Indirect Reinforcement Learning for Autonomous Power Configuration and Control in Wireless Networks

In this paper, non deterministic Indirect Reinforcement Learning (RL) techniques for controlling the transmission times and power of Wireless Network nodes are presented. Indirect RL facilitates planning and learning which ultimately leads to
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  Indirect Reinforcement Learning for Autonomous Power Configuration and Control in Wireless Networks   Adrian Udenze and Klaus McDonaldMaier School of Computer Science and Electronic Engineering, Embedded and Intelligent Systems  Research Group, University of Essex, Wivenhoe ar!, Colchester, C"# $S%, U&' auden()essex'ac'u!, !dm)essex'ac'u!'  *ax+ ## - ./01 23 /322   Abstract  In this paper, non deterministic Indirect  Reinforcement 4earning 5R46 techni7ues for controlling the transmission times and po8er of Wireless 9et8or! nodes are presented' Indirect R4  facilitates planning and learning 8hich ultimately leads to convergence on optimal actions 8ith reduced episodes or time steps compared to direct R4' :hree  ;yna architecture based algorithms for non deterministic environments are presented' :he results  sho8 improvements over direct R4 and conventional  static po8er control techni7ues' 1. Introduction Wireless communication is known to be unpredictable. Several factors affect the ualit! of the si"nal at the receiver includin"# the part of the freuenc! spectrum bein" utilized$ the environment in which communication is takin" place$ the modulation schemes bein" implemented$ the  placement of nodes in the environment as well as the t!pes of communicatin" devices bein" deplo!ed %&' . (hese factors manifest themselves in phenomena like fadin"$ propa"ation loss$ scatterin"$ reflection and diffraction on the transmitted si"nal %)'. (he power at which a si"nal is transmitted has a direct influence on the e*tent of these manifestations. (ransmit at a  power too low and the si"nal attenuates too much to  be successfull! received at a receiver$ transmit at a  power too hi"h and the transmitted si"nal interferes with other si"nals in the vicinit! of the transmission. Unnecessaril! hi"h transmitter power levels also have the added detrimental effect of depletin" power resources available to a Wireless +etwork node. ,or small batter! powered Wireless Sensor +etwork -WS+ nodes this can ultimatel! lead to untimel! network failure. ,urthermore$ the transmitter power  problem also has a bearin" on other aspects of the network protocol. /t affects the levels of contention at the MA0 la!er. (he power at which si"nals are transmitted chan"es the ran"e of the transmitted si"nal which in turn chan"es the number of nei"hbours each node has which in turn chan"es the number of contendin" nodes for media access. /n a multihop network$ the transmitter power also affects the number of hops to the receiver which in turn affects the levels of traffic passin" throu"h each node. (he routes taken throu"h a network are also affected b! the transmitter power. (ransmittin" at hi"h power reduces the dela! incurred b! "oin" throu"h multiple hops but ener"! consumed increases e*ponentiall! as distance increases$ a relationship "iven b!    r   < c  t   => ?  where    r   is the power at the receiver$ c  is a constant$    t   is transmit power$  >  is distance and ?  is 1 ) %2'. (o this end si"nificant research has "one into findin" optimal levels of transmission power to ensure ma*imal lon"etivit! of the network balanced with acceptable levels of network performance. (he contributions of this paper are to further the work done in %&3' -a direct 45 approach to 6ower 0ontrol  b! presentin" an indirect 45 solution to the power control problem in wireless networks. (he authors  believe this indirect approach will reduce the number of episodes reuired to conver"e on optimal results compared to direct 45. (he authors also present three indirect 45 al"orithms suitable for use in non deterministic environments. (he work presented while specificall! tar"etin" WS+ is also relevant to other t!pes of Wireless +etworks. ,ollowin" is a brief literature review. /n section 2 the authors present 2 D!na al"orithms for use in non deterministic environments. /n section 7 the power  problem is reviewed followed b! simulation results in section 8 before concludin" in section 9. 2. Literature review  4esearch into power control can be loosel! classified into four "roups. (he first "roup attempts to find optimal transmit power as a means of controllin" the connectivit! properties of a network. (he work done in %7' works b! iterativel! mer"in" node connections until there is one full! connected network. /n %8'$ Wattenhofer et. al demonstrate a distributed 2009 NASA/ESA Conference on Adaptive Hardware and Systems 978-0-7695-3714-6/09 $25.00 © 2009 Crown CopyrightDOI 10.1109/AHS.2009.51297  al"orithm where each node makes local decisions and "rows its transmitter power until a nei"hbour is found in ever! direction. 6roposal %9' works in a similar wa! b! a"ain controllin" a nodes de"ree. 0ompow %:' tackles the problem at the network la!er and works b! runnin" multiple network protocols at each  power level until a minimum power level that ensures network connectivit! is found. ,or non homo"eneous networks however$ a sin"le node could force all nodes to transmit at a hi"her power to ensure network connectivit!. (his draw back is addressed in %2' and %3' b! tailorin" the minimum transmit power for individual node pairs. (he second "roup of transmitter power control techniues emplo! power as a metric for findin" optimal transmit power b! wa! of findin" optimal routes. (he work done in %;' is one such techniue where ener"! consumed per  packet -amon"st others is su""ested as a metric. /n %&<' statistics from the link and ph!sical la!ers are used to calculate the cost of links. (he third "roup of techniues comprise techniues that work in the MA0 la!er. /n %&&' information shared between nei"hbours is used to control transmitter power b! calculatin" the ualit! of service in the network. (echniues in %&)' and %&2' modif! the /=== 3<).&&  protocol b! choosin" levels that ensure minimum interference with on"oin" transmissions based on estimates from a bus! si"nal. (he fourth "roup of techniues for controllin" power is in contrast with above techniues as the! treat the transmitter power problem as a non deterministic  problem. (echniues that do not take into account the non deterministic nature of transmissions in Wireless  +etworks will be prone to power wasta"e retransmittin" unsuccessful packet deliveries %&'. (he noise levels of the communication channel are known to fluctuate over time. A power level that results in a successful packet deliver! at a point in time will not necessaril! "uarantee successful packet deliver! due to these fluctuations. (he work done in %&' studies  packet deliver! performance in a WS+. Several aspects of the author>s findin"s are of particular interest. 5osses due to multipath cancellations and  path loss dominate low power transmissions and successful packet transmissions varied considerable for nodes in the same vicinit!. ?ther nois!  phenomena such as scatterin"$ fadin"$ reflection and diffraction all contribute to the non determinism of WS+ transmissions. ,rom a theoretical standpoint$ si"nificant work has been done on modelin" the wireless communication medium. /n %)' the si"nal at the receiver is "iven as# recSignalStrength < Sendo8er @ ;"I loss  fading where  ;"I   -de"ree of /rre"ularit! @  pathloss  * !  i σ   A d d ndBd  4dBd  4  +    ××+= <&<<  lo"&<'%-'%-  -& - d  4  is the path loss at distance d  $ - < d  4  is the  path loss at the reference distance$ n  is the path loss e*ponent σ   A   is a random variable$ and !  i  is a random irre"ularit! factor. /n %&7'$ the si"nal to noise ratio in a broadcast channel is modelled as  B    ν γ    =   per sub channel where     is the avera"e transmit  power$  B  is the received si"nal stren"th and   is a Baussian noise densit!. 0learl!$ increasin" the transmitter power increases the ualit! of si"nal at the receiver however there is a random component in the models which can intermittentl! de"rade the ualit! of si"nal at the receiver. Cuantitative work in %&8' also consolidates the above statement. hou et. al show that even nodes out of communication ran"e of a transmittin" node can interfere with the transmission. and also that this interference is unpredictable i.e. nodes within communicatin" distance do not alwa!s interfere. (he work done in %&7' falls within the fourth "roup of techniues. E! treatin" the communication channel as susceptible to random noise$ transmitter power is adFusted to var! with fluctuations in the received si"nal stren"th. /n %&9' the authors present the transmitter power  problem as a non deterministic optimisation problem. (he techniue presented models the problem as a Markov Decision process -MD6 in which transmitter power is optimised subFect to dela! constraints. (he communication channel is treated as a stochastic environment in which the channel states transition from "ood to bad in a random fashion. (he work demonstrates power savin"s over fi*ed static  power settin"s in the same environment. /n %&:' the formulation is improved and modelled as a partiall! observable MD6 in which the state of the communication channel is partiall! observable. Eoth techniues above are model based in which time is spent observin" and acuirin" models of channel transitions based on which optimal solutions are found. (he draw back to these model based techniues is addressed in %&3'. /n this case no model is assumed and an optimised power settin" is attained merel! b! interactions with the environment thus makin" the techniue completel! autonomous in addition to the other inherent properties of this techniue namel!# d!namic$ scalable and proactive. . !odel based reinforcement learning 45 as described in %&;' and %)<' is fundamentall! learnin" how to map situations to actions so as to ma*imise a numeric reward. An overview of 45 is  presented in %&3'$ however for brevit! a brief description necessar! for brevit! of the work done here follows. (he "oal of a 45 a"ent is to learn actions that ma*imise a numerical reward "iven b! ∑ =++ = : ! ! t ! t   r  R <& γ   -)  parameter G < H G H &. :   -time step @ I or G @ &. 298  Assumed in the 45 problem formulation is the Markov propert! where the state of the environment at time t.  is a function onl! of the current state t    { } t t t t   a sr r  s s  $J$6r  &&  ==  ++  -2 for all  sD, r, s t    and a t  . (he solution of a 45 problem euates to evaluatin" the value of actions in an! "iven state with a view to reinforcin" actions with the hi"hest values. ,ormall!$ "iven a polic! L$ a set of states  S  s ∈  $ and action set -  s a ∈  $ F5s,a6  is the probabilit! of takin" action a  in state  s  under polic! L. (he value of  s  under polic! L is { } ==== ∑ ∞=++ <&  JJ- ! t ! t ! t t   s sr  E  s s R E  s   γ  π π π   -7 where  E  F  H  is the e*pected value under polic! L. (he solution for   F  5s6  e*ploits the recursive relationships between state  s  and the followin" states  sD   and is "iven b! the Eellman euation ∑ ∑  += a sa ssa ss  s  R  a s s    -$--  π π  γ π   -8 A similar definition for the action value function % F  5s,a6,  e*plicitl! "ives the value of takin" action a  in state  s  under polic! L as { } ====== ∑ ∞=++ <&  $J$J$- ! t t ! t  ! t t t  aa s sr  E  aa s s R E a s% γ  π π π   -9 An optimal value function   J  is "uaranteed to be optimal if   F   the value function of a "iven polic! is "reater than or eual to   FD   for all states. ∑  +==  -ma*-ma*-  sa ssa ssa  s  R   s  s   γ  π π   -: for all S  s ∈  . Similarl!$ ∑  !+==  $-ma*$-ma*$-  saa ssa ss  a s% R  a s%a s%  γ  π π   for all S  s ∈  and -  s a ∈  -3 Where a model of the environment does not e*ist i.e. the parameter a ss      is not available$   J  and % J  are obtained b! estimatin" these values from e*perience. (he most "eneral 45 techniue is (D-N which  brid"es M0 and (D-< %&3' methods. Eackups are wei"hted b! N where < HNH&. ∑ ∞=" "= &-& &- nnt nt   R R  # #  #   -;  R t   is the e*pected return. 5earnin" updates are carried out usin" the rule#  9e8estimate K "ldestimate  StepSi(eL:arget @ "ldestimate (he  step si(e  is the learnin" rate - ?  and the (ar"et$ the newl! acuired value -  R t  . All non "reed! actions are selected with a probabilit! OJ  5s6M  and the "reed! action with probabilit! & P O Q OJ  5s6M  in this work. /n contrast with the work done in the first part of this research %&;' where direct 45 is emplo!ed i.e. polic! evaluations are carried out as a result of real e*perience from the environment$ this work emplo!s  both direct and indirect 45. (his t!pe of 45 is referred to as DR+A learnin" in %&;'. A dia"ram of the architecture below is reproduced from %&;'. Figure 1: Dyna architecture. (he aim of this work fundamentall! is to investi"ate techniues that ma! offer an improvement on the  performance of direct 45 emplo!ed previousl! i.e. reduce time taken to conver"e on optimal actions. (hree D!na al"orithms are e*perimented with and are "iven below. (he D!naC al"orithm "iven in %&;' is for a deterministic environment. (o this end the authors have modified the D!naC al"orithm for use in nondeterministic environments and is "iven in fi"ure ) below. (o minimise time spent plannin" the authors have opted to carr! out the plannin" process at the end of each episode rather than at ever! time step. (D-N was confirmed in the first part of this two  part series to perform mar"inall! better than C learnin" %&;'. (o this end$ the authors have a"ain modified the (D-N al"orithm to be used for plannin" in a non deterministic environment$ fi"ure 8 termed D!na(D. ,inall! the al"orithm in fi"ure 2 is a modified version of the prioritised sweepin" al"orithm presented in %&;'. (he modifications we have carried out make the al"orithm suitable for non deterministic environments. ". #ransmitter $ower $roblem (he transmitter power problem is formulated in the first part of this paper %&;'. ,or the purpose of  brevit!$ the 45 transmitter power problem is briefl! discussed. (he data set and problem formulation are kept the same as in the earlier work. (his is so that valid comparisons can be made on the effectiveness or lack there of$ of the 45 techniues proposed here as improvements on the previous work. (he fi"ure 7 below depicts a cross section of a WS+. While emphasis is placed on WS+$ the same treatment is also applicable to ad hoc networks where nodes communicate data not Fust do data sinks as in WS+ but also to each other. +odes A$ E$ 0$ D$ = and , are within communication ran"e and therefore 6olic!  alue Simulated e*perience real e*perience =nvironment Model  plannin" direct 45 search control model 299    Figure 2: DynaQ algorithm for non deterministic environments. Figure 3: Prioritised sweep algorithm for non deterministic environments. Figure : !ection of "!# showing nodes $ and % transmissions within range of &. Figure ': Dyna(D for non deterministic environments  interference ran"e of each other dependin" on the transmitter power used. (he noise levels in the communication channel due to various phenomena described previousl! is uantized into 2 distinct levels namel!# low noise level -5+5$ medium noise level -M+5 and hi"h noise level -T+5. (he  probabilit! of successful packet deliver! depends therefore on the noise levels durin" transmissions. /n the simplest case$ a node with a pendin" packet for transmission will make ever! attempt for a successful transmission b! usin" an appropriate power level. ?n the other hand$ to save ener"!$ a node ma! choose to dela! transmission in a nois! environment or transmit at a low power -with limited chances of success at the e*pense of lon"er transmission dela!s. Such a s!stem is depicted in fi"ure 9 below. With a packet  pendin" transmission$ a node has a choice of 2 actions# wait$ transmit at low power or transmit at hi"h power. (he outcome or followin" state depends on the choice of action taken in the present state. (hree decision epochs are assumed i.e. the time  between packet arrivals is lon" enou"h to allow three attempts at transmissions without causin" e*cessive ueue len"ths. At the third attempt$ a transmittin" node must make ever! attempt at a successful transmission else a lar"e cost is incurred. (his lar"e A A E 0 A B S D = , Initiali%e  &   & count   for all  & and    to em$t' (o forever) *a+ s , current *nonterminal+ state *b+ a , $olic'*s&-+ *c+ /ecute action  0 observe resultant state   & and reward&  *d+ [ ] $$-& $$- &$$-$$-  sa s N   sa s!   sa s N  sa s N   "+$  *e+ ∑  !+"$   $-ma*$-  saa ssa ss  a s% R N a s% p  γ   *f+ If     & t3en insert  &   into    wit3 $riorit'   *g+ Re$eat N times& w3ile    is not em$t')   , first*P-ueue+ ∑  !+=   $-ma*$-  saa ssa ss  a s% R N a s%  γ   Re$eat& for all s 4 & a 4  $redicted to lead to  ) r 4  , $redicted reward ∑  !+"$   $-ma*$-  saa ssa ss  a s% R N a s% p  γ   If      t3en insert s 4 & a 4  into    wit3 $riorit'     Initialise  &    and count   for all S  s ∈  and -  s a ∈  (o forever) *a+   , current *non4terminal+ state *b+   , %   4greed'  *c+   /ecute action  0 observe resultant state&    and reward&  *d+ [ ] $-$-ma*$-$-   a s%a s%r a s%a s% a  "++$  γ &   *e+ [ ] $$-& $$- &$$-$$-  sa s N   sa s!   sa s N  sa s N   "+$  *f+ If   is terminal state)    ,     Re$eat < $'  5or eac3 # S  s ∈   -  s v  $   ∑  +$    -ma*-  sa ssa ssa  s  R N  s   γ    -$ma*-  s v  "'$'  6ntil (  )'  *a small $ositive number+ -ma*ar"$-    s  R N a s%  a ss sa ssa  γ  += ∑     7    Initialise  &    &   and count   for all S  s ∈  and -  s a ∈  (o forever) Re$eat *for eac3 e$isode+ Initiali%e    Re$eat *for eac3 ste$ of e$isode+) #ake action  & observe      [ ] $$-& $$- &$$-$$-  sa s N   sa s!   sa s N  sa s Nodel   "+$  C3oose a8 from s8 using $olic' derived from - $-$-  a s%a s%r   "+$  γ *    &$-$-  +$  a sea se  5or all s&a) $-$-$-  a sea s%a s%  &*  +$   $-$-  a sea se  γ#  $     aa s s  $$  6ntil s is terminal    ,     Re$eat < $'  5or eac3 # S  s ∈   -  s v  $   ∑  +$    -ma*-  sa ssa ssa  s  R N  s   γ    -$ma*-  s v  "'$'  6ntil (  )'  *a small $ositive number+ -ma*ar"$-    s  R N a s%  a ss sa ssa  γ  += ∑     7    300  cost reflects the added dela!s to ueues and cost of retransmissions. (hus$ the s!stem c!cles as follows# a c!cle be"ins in state  494., N94.  or  O94.  with some probabilit!. A successful deliver! results in a transition to state  success $ else the s!stem transitions to state  494/, N94/  or  O94/  with var!in"  probabilities where an action is a"ain decided on which results in a transition. /n state  494$, N94$  and  O94$  the transmittin" node must make ever! attempt to transmit successfull! followin" which the s!stem transitions to state  success  or  fail   and the c!cle repeats. (able & below "ives the probabilit! distributions used for simulations. /t is important to node that the formulation and data set is used purel! to aid anal!sis and is b! no means fi*ed. ,or e*ample$ the state of the communication channel ma!  be uantized into more states$ the number of decision epochs ma! be more or less dependin" on network traffic and the choice of actions in each state ma! not  Fust be power levels but also routes. What this "iven e*ample however demonstrates is the principle of  problem formulation and as will be shown in the results section$ even with this conservative data set$ there is a marked improvement on power "ain versus dela! when compared with static transmitter power settin"s. Figure ): !tate transition diagram for e*ample given. 9. Pro$osed im$lementation for $ower control 6ower control is intrinsicall! tied to routin" in wireless networks. %7'$ %8'$ %9'$ %:'$ %3'$ %;'$ %&<' all  provide references to power control bein" used to decide optimum routes and network topolo"ies. Beo"raphic statistics are also used in %)&' and %))' for decidin" network routes. (he work in %&8' clearl! demonstrates that "eo"raphic information is not enou"h to determine optimal routes -so called interference connectivity  assumption. ,urthermore the ualit! of the communication channel is transient %&' therefore optimal routes at one instant ma! not be optimal at the ne*t. (he conclusion drawn which is the drivin" force behind this research is that route findin" at best should be treated as a stochastic  problem such as the works presented in %&3' and in this paper. (his implies that the optimal decisions over routes are made over man! repeated transmissions. (he authors proposed techniue involves nodes maintainin" a routin" table which is in effect a value function of choices of actions in various states$ where states represent the state of the communication channel and actions represent a choice of power levels and routes. Usin" fi"ure 8 above for illustration$ node    has a choice of three routes via nodes  ;, E   and  *   enroute S  . Assumin" a setup phase where all nodes are active -discountin" sleepin" nodes %)2'    transmits at a choice of power levels such that nodes within hearin" distance pick up the transmitted si"nal and forward to S  . =ach node enroute stamps the received packet with the route identit! and costs as in standard route findin" techniues %))' such that the best route to S   can be deciphered. S   upon receipt also sends an acknowled"ement back to    so that    knows the optimum route to S  . 45 occurs and eventuall! the optimum route and power levels are used. (he fundamental difference from other route findin" and  power level techniues bein" the emphasis on non determinism and therefore makin" choices over multiple repeated events. :. ;imulation results (he contributions of this paper are to present improved 45 al"orithms to aid earlier conver"ence -reduced number of time steps or episodes on the 45 techniues previousl! used to solve the transmitter  power problem in %&3'. Al"orithms used for simulations were described in section 2 above. (he first set of results fi"ure : compares the best results a"ainst number of episodes of four al"orithms namel! the three indirect 45 al"orithms described in section 2 above and the best performin" direct 45 results from %&3'. D!naC is "iven in fi"ure )$ D!na(D in fi"ure 8 and 6rVSwp in fi"ure 2. (D-N is "iven in %&3' and %&;'. Due to the stochastic nature of the task$ different actions in different states conver"e at different rates. (he best results show the  percenta"e ratio of the closest results to e*act values as a function of episodes. (he worst results presented in fi"ure 3 also compares all four al"orithms. While  both results show little difference in performance for all three model based -indirect 45 techniues$ all three show a marked improvement on the direct (D-N techniue. /n particular note that eventuall! the worst results conver"e on ver! close to e*act results which is not the case for (D-N. (he ne*t set of results$ fi"ure ;$ show the percenta"e ratio of correct actions decided on b! the al"orithms a"ainst number of episodes. (D-N performs better in the earl! sta"es but is then outperformed b! the model   LNL1 !NL1 <NL1 ;uccess 5ail LNL2 !NL2 <NL2 LNL !NL <NL 301
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!