of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
  Toward Packet Routing with Fully-distributedMulti-agent Deep Reinforcement Learning Xinyu You, Xuanjie Li, Yuedong Xu*, Hui Feng  Research Center of Smart Networks and SystemsSchool of Information Science and TechnologyFudan University, Shanghai, China { xyyou18, xuanjieli16, ydxu, hfeng } Jin Zhao School of Computer ScienceFudan University, Shanghai, China  Abstract —Packet routing is one of the fundamental problemsin computer networks in which a router determines the next-hopof each packet in the queue to get it as quickly as possible toits destination. Reinforcement learning has been introduced todesign the autonomous packet routing policy namely Q-routingonly using local information available to each router. However,the curse of dimensionality of Q-routing prohibits the morecomprehensive representation of dynamic network states, thuslimiting the potential benefit of reinforcement learning. Inspiredby recent success of deep reinforcement learning (DRL), weembed deep neural networks in multi-agent Q-routing. Eachrouter possesses an independent neural network that is trainedwithout communicating with its neighbors and makes decisionlocally. Two multi-agent DRL-enabled routing algorithms areproposed: one simply replaces Q-table of vanilla Q-routing bya deep neural network, and the other further employs extrainformation including the past actions and the destinations of non-head of line packets. Our simulation manifests that the directsubstitution of Q-table by a deep neural network may not yieldminimal delivery delays because the neural network does notlearn more from the same input. When more information isutilized, adaptive routing policy can converge and significantlyreduce the packet delivery time. I. I NTRODUCTION Packet routing is a very challenging problem in distributedand autonomous computer networks, especially in wirelessnetworks in the absence of centralized or coordinated serviceproviders. Each router decides to which neighbour it shouldsend his packet in order to minimize the delivery time. Theprimary feature of packet routing resides in its fine-grainedper-packet forwarding policy. No information regarding thenetwork traffic is shared between neighbouring nodes. Incontrast, exiting protocols use flooding approaches either tomaintain a globally consistent routing table (e.g. DSDV [10]),or to construct an on-demand flow level routing table (e.g.AODV [9]). The packet routing is essential to meet thedynamically changing traffic pattern in today’s communicationnetworks. Meanwhile, it symbolizes the difficulty of designingfully distributed forwarding policy that strikes a balance of choosing short paths and less congested paths through learningwith local observations. This work is partially supported by Natural Science Foundation of China(No. 61772139),the National Key Research and Development Program of China (No.213), Shanghai-Hong Kong CollaborativeProject under Grant18510760900 and CERNET Innovation Project NGII20170209. Reinforcement learning (RL) is a bio-inspired machinelearning approach that acquires knowledge by exploring theinteraction with local environment without the need of exter-nal supervision [1]. Therefore, it is suitable to address therouting challenge in distributed networks where each node(interchangeable with router) measures the per-hop deliverydelays as the reward of its actions and learns the best actionaccordingly. Authors in [5] proposed the first multi-agent Q-learning approach for packet routing in a generalized network topology.This straightforward routing policy achieves much smallermean delivery delay compared with the benchmark shortestpath approach. Xia et al. [29] applied dual RL-based Q-routing approach to improve convergence rate of routing incognitive radio networks. Lin and Schaar [3] adopted the joint Q-routing and power control policy for delay sensitiveapplications in wireless networks. More applications of RL-based routing algorithms can be found in [11]. Owing to thewell-known “curse of dimensionality” [14], the state-actionspace of RL is usually small such that the existing RL-basedrouting algorithms cannot take full advantage of the history of network traffic dynamics and cannot explore sufficiently moretrajectories before deciding the packet forwarding. The com-plexity of training RL with large state-action space becomesan obstacle of deploying RL-based packet routing.The breakthrough of deep reinforcement learning (DRL)provides a new opportunity to a good many RL-based net-working applications that are previously perplexed by theprohibitive training burden. With deep neural network (DNN)as a powerful approximator of Q-table, the network designercan leverage its advantages from two aspects: (1) the neuralnetwork can take much more information as its inputs, en-larging the state-action space for better policy making; (2) theneutral network can automatically abstract invisible featuresfrom high-dimensional input data [17], thus achieving an end-to-end decision making yet alleviating the handcrafted featureselection technique. Recent successful applications includecloud resource allocation [20], adaptive bitrate video streaming[21], cellular scheduling [22]. DRL is even used to generate routing policy in [23] against the dynamic traffic pattern thatis hardly predictable. However, authors in [23] considers acentralized routing policy that requires the global topology and   a  r   X   i  v  :   1   9   0   5 .   0   3   4   9   4  v   1   [  c  s .   N   I   ]   9   M  a  y   2   0   1   9  the global traffic demand matrix, and operates at the flow-level.Inspired by the power of DRL and in view of the limitationsof Q-routing [5], we aim to make an early attempt to developfully-distributed packet routing policies using multi-agent deepreinforcement learning.In this paper, we proposed two multi-agent DRL routingalgorithms for fully distributed packet routing. One simplyreplaces the Q-table of vanilla Q-routing [5] by a carefullydesigned neural network (Deep Q-routing, DQR). The inputinformation, i.e. the destination of the head of line (HOL)packet in the queue, remains unchanged except for being one-hot encoded. The other introduces extra information as theinput of the neural network, consisting of the action historyand the destinations of future packets (Deep Q-routing withextra information, DQR-EI). We conjecture that the actionhistory is closely related to the congestion of next hops, thenumber of packets in the queue indicates the load of thecurrent router, and knowing the destinations of the comingoutgoing packets avoids pumping them into the same adja-cent routers. With such a large input space, the Q-routing[5] cannot handle the training online and the training of deep neural networks using RL rewards becomes necessary.DQR-EI is fully distributed in the sense that each router isconfigured with an independent neural network, and it hasno knowledge about the queues and the DNN parametersof neighbouring routers. This differs from the recent multi-agent DRL learning framework in other domains [6] wherethe training of neural networks are simultaneous and globallyconsistent. The training of multi-agent DRL is usually difficult(e.g. convergence and training speed), while DQR and DQR-EIprove the feasibility of deploying DRL-based packet routingin the dynamic environment.Our experimental results reveal two interesting observations.Firstly, simply replacing Q-tables by DNNs offers the compa-rable delivery delay with the srcinal Q-routing. The differentrepresentations for the same input implicitly yield almost thesame Markov decision process (MDP) policy. Secondly, DQR-EI significantly outperforms DQR and Q-routing in terms of the average delivery delay when the traffic load is high. Afterexamining the routing policy of DQR-EI, we observe thateach router makes adaptive routing decision by consideringmore information than the destination of the HOL packet, thusavoiding congestion on “popular” paths.The remainder of this paper is organized as follows: SectionII reviews the background knowledge of RL and DRL. SectionIII presents our design of DQR and DQR-EI. The deliverydelay of the proposed algorithms is evaluated in Section IVwith Q-routing as the benchmark. Section V is devoted tomakes discussions about future study and challenges. SectionVI concludes this work.II. B ACKGROUND AND  L ITERATURE In this section, we briefly review RL and DRL techniquesand their applications to routing problem and then put for-ward the necessity of fully-distributed learning for real-worldrouting problem.  A. RL algorithm Based on the mapping relationship between observed stateand execution action, RL aims to construct an agent to max-imize the expected discounted reward through the interactionwith environment. Without prior knowledge of which state theenvironment would transition to or which actions yield themost reward, the learner must discover the optimal policy bytrial-and-error.The first attempt to apply RL in the packet routing problemis Q-routing algorithm, which is a variant of Q-learning [1].Since Q-routing is essentially based on multi-agent approach,each node is viewed as an independent agent and endowedwith a Q-table to restore Q-values as the estimate of thetransmission time between that node and others. With the aimof shortening average packet delivery time, agents will updatetheir Q-table and learn the optimal routing policy through thefeedback from their neighboring nodes when receiving thepacket sent to them. Despite the superior performance overshortest-path algorithm in dynamic network environment, Q-routing suffers from the inability to fine-tune routing policyunder heavy network load and the inadequate adaptabilityof network load change. To address these problems, otherimproved algorithms have been proposed such as PQ-routing[7] which uses previous routing memory to predict the traffictrend and DRQ-routing [8] which utilizes the informationfrom both forward and backward exploration to make betterdecisions.  B. DRL algorithm DRL embraces the advantage of deep neural networks tothe training process, thereby improving the learning speed andthe performance of RL [4]. One popular algorithm of DRL isDeep Q-Learning (DQL) [24], which implements a Deep Q-Network (DQN) instead of Q-table to derive an approximateof Q-value with special mechanisms of experience replay andtarget Q-network.Recently, network routing problems with different environ-ment and optimization targets are solved with DRL. Basedon the control model of the agent, these algorithms can becategorized as follows: Class 1: Single-agent learning. Single-agent algorithm treats the network controller as acentral agent which can observe the global information of the network and control the packet scheduling strategy of every router. Both the learning and execution process of this kind of algorithm are centralized [25], in other words,the communication between routers are not restricted duringtraining and execution.With the single-agent algorithm, SDN-Routing [23] presentsthe first attempt to apply DRL in the routing optimization of traffic engineering. Viewing the traffic demand, which repre-sents the bandwidth request between each source-destinationpair, as the environment state, the network controller deter-mines the transmission path of packets to achieve the objectiveof minimizing the network delay. Another algorithm [19]  considers a similar network model while taking minimum link utilization as the optimization target. Class 2: Multi-agent learning. In multi-agent learning, every router in the network istreated as a single agent which can observe only the localenvironment information and take actions according to its ownrouting policy.The first multi-agent DRL learning algorithm applied inthe routing problem is DQN-routing [6] by combining Q-routing and DQN. Each router is regarded as an agent whoseparameters are shared by each other and updated at thesame time during training process (centralized training), butit provides independent instructions for packet transmission(decentralized execution). The comparison with contemporaryrouting algorithms in online tests confirms a substantial per-formance gain. C. Fully-distributed learning Algorithms with centralized learning process stated aboveare not applicable in the real computer network. The central-ized learning controller is usually unable to gather collectedenvironment transitions from widely distributed routers oncean action is executed somewhere and to update the parametersof each neural network simultaneously caused by the limitedbandwidth.Accordingly, for better application in real-world scenario,the routing algorithms we proposed are based on fully-distributed learning, which means both the training processand the execution process are decentralized. Under thesecircumstances, every agent owns its unique neural network with independent parameters for policy update and decisionmaking, thereby avoiding the necessity for the communica-tions among routers in the process of environment transitioncollection and parameter update.III. D ESIGN We establish the mathematical model of the packet routingproblem and describe the representation of each element inthe reinforcement learning formulation. Then we put forwardtwo different deep neural network architectures substitutingthe srcinal Q-table and propose the corresponding trainingalgorithm.  A. Mathematical Model Network.  The network is modeled as a directed graph G   = (  N  , E  ) , where  N   and  E   are defined as finite sets of routernodes and transmission links between them respectively. Asimple network topology can be found in Fig. 1, containingfive nodes and five pairs of bidirectional links. Every packetis srcinated from node  s  and destined for node  d :  s,d  ∈ N  and  s   =  d  with randomly generated intervals. Routing.  The mission of packet routing is to transfer eachpacket to its destination through the relaying of multiplerouters. The queue of routers follows the first-in first-out(FIFO) criterion. Each router  n  constantly delivers the packet 3 40 1 2 Fig. 1. 5-node network topology. in the head of line to its neighbor node  v  until that packetreaches its termination. Target.  The packet routing problem aims at finding theoptimal transmission path between source and destinationnodes based on some routing metric, which, in our experiment,is defined as the average delivery time of packets. Formally,we denote the packet set as  P   and the total transmission timeas  t  p  for every packet  p  :  p  ∈ P  . Our target is to minimizethe average delivery time  T   =   p ∈P   t  p /K  , where  K   denotesthe number of packets in  P  .  B. Reinforcement Learning Formulation The packet routing can be modeled as a multi-agent rein-forcement learning problem with partially observable Markovdecision processes (POMDPs) [28], where each node is anindependent agent which can observe the local network stateand make its own decisions according to an individual routingpolicy. Therefore, we will illustrate the definitions of eachelement in reinforcement learning for a single agent. State space.  The packet  p  to be sent by agent  n  is definedas  current packet  . We denote the state space of agent  n  as  S  n  : { d  p ,E  n } , where  d  p  is the destination of the current packet and E  n , which may be empty, is some extra information relatedto agent  n . At different time steps, the state observed by theagent is time varying due to the dynamic change of network traffic. Action space.  The action space of agent  n  is defined as A n  :  V  n , where  V  n  is the set of neighbor nodes of node  n .Accordingly, for every agent, the size of action space equalsto the number of its adjacent nodes, e.g., each node in Fig. 1has two candidate actions. Once a packet arrives at the headof queue at time step  t , agent  n  observes the current state s t  ∈  S  n  and picks an action  a t  ∈  A n , and then the currentpacket is delivered to the corresponding neighbor of node  n . Reward.  We craft the reward to guide the agent towardseffective policy for our target: minimizing the average deliverytime. The reward at time step  t  is set to be the sum of queueingtime and transmission time:  r t  =  q   +  l , where the former  q  is the time spent in the queue of agent  n , and the latter  l  isreferred to as the transmission latency to the next hop. C. Deep Neural Network  We will introduce two types of algorithms for applying thedeep neural network into Q-routing in this section. Essentially,they both replace the original Q-table in Q-routing with a  actionhistorycurrent destination futuredestinations 32323232 N m·N k·N |A n | input layerhidden layer 1hidden layer 2output layer  Fig. 2. Fully connected neural network with ReLU activiation. neural network but utilize different information as the input.Note that in the formulation of reinforcement learning, eachnode is an individual agent and therefor possesses its own neu-ral network for decision making. Accordingly, the followingdescription of the neural network architecture is tailored for asingle agent. Class 1: Deep Q-routing (DQR) The primary aim in to find whether there is any improve-ment if the Q-table, which stores Q-value as a guideline tochoose actions, in Q-routing is replaced simply by a neuralnetwork without changing the input information. We proposean algorithm called Deep Q-routing (DQR) to compare thedifferent representation of the routing policy.As shown in the dotted box of Fig. 2, we build a fullyconnected neural network with two hidden layers and 32neurons each. The input of the neural network is the one-hot encoding of the current packets destination ID, so that thenumber of input neurons equals to the total number of nodesin the network topology. For example, for the network withfive nodes in Fig. 1, the one-hot encoding result of destinationnumber 4 is [00010]. Furthermore, the size of the outputlayer and the agents action space  | A n |  are identical, and thevalue of each output neuron is the estimated Q-value of thecorresponding action. With this change of the representationfor Q-value, we try to update the parameter of neural networksinstead of the value of the Q-table. Class 2: Deep Q-routing with extra information (DQR-EI) While both DQR and Q-routing would make the constantdecision for the packet with the same destination due to thesingle input, we propose another algorithm called Deep Q-routing with Extra Information (DQR-EI) by integrating moresystem information into each routing decision.The input information of the neural network can be clas-sified as three parts: (1) current destination: the destination Algorithm 1  Deep Q-routing (with extra information) // initialization for  agent  i  = 1,  N   do Initialize replay buffer  D i  ← ∅ Initialize Q-network   Q i  with random weights  θ i end for  // training process for  episode = 1,  M   dofor  each decision epoch  t  do Assign current agent  n  and packet  p Observe current state  s t Select and execute action a t  =  a random action with probability  argmax a Q n ( s t ,a t ; θ n )  with probability  1 −  Forward  p  to next agent  v t Observe reward  r t  and next state  s t +1 Set transmission flag  f  t  =  1  v t  =  d  p 0  otherwiseStore transition  ( s t ,r t ,v t ,s t +1 ,f  t )  in  D n Sample random batch  ( s j ,r j ,v j ,s j +1 ,f  j )  from  D n Set  y j  =  r j  +  max a  Q v j ( s j +1 ,a  ; θ v j )(1 − f  j ) θ n  ←  GradientDescent   (  ( y j Q n ( s j ,a j ; θ n )) 2 ) end forend for node of the current packet which is the same as above, (2)action history: the executed actions for the past  k  packets sentout just before the current packet, (3) future destinations: thedestination nodes of the next  m  packets waiting behind thecurrent packet. Before being input into the neural network, allof such information will be processed with one-hot encoding.As a result of the additional input information, there are somechanges in the structure of the neural network. As showedin Fig. 2, the neuron number of the input layer and the firsthidden layer is added to hold another two kinds of information,while the second hidden layer and the output layer remainunchanged. With the powerful expression capability of neuralnetworks, the agent of DQR-EI is able to execute adaptiverouting policy as the environment of network changes.In both classes of neural network, we use Rectified LinearUnit (ReLU) as the activation function and Root Mean SquareProp (RMSProp) as the optimization algorithm.  D. Learning Algorithm By integrating Q-routing and DQN, we propose the packetrouting algorithm with multi-agent deep reinforcement learn-ing, where both training and execution process are set decen-tralized. The pseudo-code of the learning algorithm is shownin Algorithm 1, in which the initialization and the trainingprocess are identical for each node.Every node  i  is treated as an individual agent and possessesits own neural network   Q i  with particular parameter  θ i  to
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!