Business

Bayesian reinforcement learning for coalition formation under uncertainty

Description
Research on coalition formation usually assumes the values of potential coalitions to be known with certainty. Furthermore, settings in which agents lack sufficient knowledge of the capabilities of potential partners is rarely, if ever, touched upon.
Categories
Published
of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  Bayesian Reinforcement Learning for Coalition Formation under Uncertainty Georgios Chalkiadakis  Dept. of Computer Science, Univ. of TorontoToronto, ON, M5S 3G4, Canadagehalk@cs.toronto.edu Craig Boutilier  Dept. of Computer Science, Univ. of TorontoToronto, ON, M5S 3G4, Canadacebly@cs.toronto.edu Abstract  Researchoncoalitionformationusuallyassumestheval-ues of potential coalitions to be known with certainty. Fur-thermore, settings in which agents lack sufficient knowledgeof the capabilities of potential partners is rarely, if ever,touched upon. We remove these often unrealistic assump-tions and propose a model that utilizes Bayesian (multia-gent)reinforcementlearningin a way that enablescoalition participants to reduce their uncertainty regarding coali-tional values and the capabilities of others. In addition, weintroduce the Bayesian Core  , a new stability concept for coalition formation under uncertainty. Preliminary exper-imental evidence demonstrates the effectiveness of our ap- proach. 1. Introduction Coalition formation, widely studied in game theory andeconomics [7], has attracted much attention in AI as meansof dynamically forming partnerships or teams of cooperat-ing agents. Most models of coalition formation assume thatthe values of potential coalitions are known with certainty,implying that agents possess knowledge of the capabilitiesoftheirpotentialpartners,orat least thatthisknowledgecanbereachedviacommunication(e.g.,see [11,12]).However,in many natural settings, rational agents must form coali-tions and divide the generated value without knowing a pri-ori what this value may be or how suitable their potentialpartners are for the task at hand. The case of an enterprisetrying to choose subcontractors while unsure of their capa-bilities is only one such example. The creation and interac-tion of  virtual organizations has been anticipated as a longtermimpactofagentcoalitiontechnologiesone-commerce;this cannot possibly be achieved without dealing with theproblem of uncertainty.The presence of uncertainty poses interesting theoreticalquestions,such as the discoveryof analogsofthe traditionalconcepts of stability. Furthermore, it suggests opportunitiesfor agents to learn about each others’ abilities through re-peated interaction, refining how coalitions are formed overtime. As a consequence, realistic models of coalition for-mation must be able to deal with situations in which thepresence of action uncertainty and the types of the poten-tial partners is translated into uncertainty about the valuesof various coalitions.To this end, we propose a new model of coalition forma-tion in which agents must derive coalitional values by rea-soning about the types of other agents and the uncertaintyinherent in the actions a coalition may take. We proposea new stability concept, the Bayesian core (BC) , suitablefor this setting, and describe a dynamic coalition formationprocess that will converge to the BC if it exists. Further-more, since one agent will generally learn something aboutthe abilities of other agents via interaction with them in acoalition, we propose a reinforcement learning (RL) modelin which agents refine their beliefs about others throughrepeated interaction. We propose a specific Bayesian RLmodel in which agents maintain explicit beliefs about thetypes of others, and choose actions and coalitions not onlyfor their immediate value, but also for their value of infor-mation (i.e., what can be learned about other agents). Weshow that the agents in this framework can reach coalitionand payoff configurations that are stable given the agents’beliefs, while learning the types of their partners and thevalues of coalitions. We believe that these ideas could be of value for, say, e-commerce applications where trust  amongpotential partners (e.g., where each is uncertain about theother’s capabilities) is an issue.We begin in Section 2 with a discussion of relevant work on coalition formation, including recent work that dealswith dynamic coalition formation and some forms of un-certainty. In Section 3 we propose a model for coalitionformation in which agent abilities are not known with cer-tainty, and actions have stochastic effects. We introduce theBayesian core concept and a suitable dynamic formationprocess.WethendescribeaBayesianRL modelinSection4that allows agents to learn about their partners throughtheirinteractions in coalitions.  2. Background Cooperative game theory deals with situations whereplayers act together in a cooperative equilibrium selectionprocess involving some form of bargaining, negotiation, orarbitration [7]. Coalition formation is one of the funda-mental areas of study within cooperative game theory. Webriefly review relevant work in this section. 2.1. Coalition Formation Let N  = { 1 ,...,n } , n > 2 , be a set of players. A sub-set S  ⊆ N  is called a coalition , and we assume that agentsparticipating in a coalition may coordinate their activitiesfor mutual benefit. A coalition structure is a partition of the set of agents containing exhaustive and disjoint coali-tions. Coalition formation is the process by which individ-ual agents form such coalitions, generally to solve a prob-lem by coordinating their efforts. The coalition formationprocess can be seen as being composed of the following ac-tivities [1, 10]: (a) the search for an optimal coalition struc-ture; (b) the solution of a joint problem facing members of each coalition; and (c) division of the value of the gener-ated solution among the coalition members. 1 While seemingly complex, coalition formation can beabstracted into a fairly simple model [7]. A characteristic function υ : 2 N  →  defines the value υ ( S  ) of each coali-tion S  . Intuitively, υ ( S  ) represents the maximal payoff themembersof  S  canjointlyreceivebycooperatingeffectively.An allocation is a vector of payoffs x = ( x 1 ,...,x n ) as-signing some payoff to each i ∈ N  . An allocation is feasi-ble w.r.t.coalitionstructure CS  if   i ∈ S  x i ≤ υ ( S  ) foreach S  ∈ CS  ,andis efficient  ifthis holdswith equality.Whenra-tional agents seek to maximize their individualpayoffs, sta-bility becomes critical. Research in coalition formation hasdeveloped several notions of stability, among the strongestbeing the core . Defn 1 Let CS  be a coalition structure, and let x ∈  n besome allocation of payoffs to the agents. The core is the setof payoff configurations C  = { ( x, CS  ) |∀ S  ⊆ N, X i ∈ S x i ≥ υ ( S  )and X i ∈ N  x i = X S ∈ CS  υ ( S  ) } In a core allocation, no subgroup of players can guaranteeall of its members a higher payoff. As such, no coalitionwould ever “block” the proposal for a core allocation. Un-fortunately, the core might be empty, and, furthermore, it isexponentially hard to compute. 2 Apart from the core, there 1 Throughout we assume transferable utility.2 We do not deal with complexity issues related with coalition forma-tion in this work; we note, however, that such issues have become thefocus of recent research (e.g., [9]). exist other solution concepts such as the Shapley value andthe kernel [7].In recent years, there has been extensive research cover-ing many aspects of the coalition formation problem. Nonehas yet dealt with dynamic coalition formation under the“extreme”uncertaintywetackleinthispaper.However,var-ious coalition formation processes and some types of un-certainty have been studied. We briefly review some of thework upon which we draw.Dieckmann and Schwalbe [5] describe a dynamic pro-cess of coalition formation (under the usual deterministiccoalition model). This process allows for exploration of suboptimal “coalition formation actions.” At each stage, arandomlychosenplayerdecides whichof theexistingcoali-tions to join, and demands a payoff. A player will join acoalition if and only if he believes it is in his best inter-est to do so. These decisions are determined by a “non-cooperative best-reply rule”, given the coalition structureand allocation in the previous period: a player switchescoalitions if his expected payoff in the new coalition ex-ceeds his current payoff; and he demands the most he canget subject to feasibility. The players observe the presentcoalitional structure and the demands of the other agents,and expect the current coalition structure and demand toprevail in the next period. The induced Markov process(when all players adopt the best-reply rule) converges to anabsorbing state; and if players can explore with myopicallysuboptimal actions all absorbing states are core allocations.Konishi and Ray [6] study a somewhat related coalition for-mation process.Sjuis et al. [14, 13] introduce stochastic cooperativegames (SCGs) , comprising a set of agents, a set of coali-tional actions, and a function assigning to each action a ran-dom variable with finite expectation, representing the pay-off to the coalition when this action is taken. These papersprovide strong theoretical foundations for games with thisrestricted form of action uncertainty, and describe classesof games for which the core of a SCG is nonempty (thoughno coalition formation process is explicitly modeled).Finally, Shehory and Kraus have proposed coalition for-mation mechanisms that take into account the capabilitiesof the various agents [11] and deal with expected payoff al-location [12]. However, information about the capabilitiesor resources of others is obtained via communication. 2.2. Bayesian Reinforcement Learning Since we will adopt a Bayesian approach to learningabouttheabilitiesofotheragents,we brieflyreviewrelevantprior work on Bayesian RL. Assume an agent is learning tocontrolastochastic environmentmodeledas a Markovdeci-sion process (MDP) S  , A ,R, Pr  , with finite state and ac-tion sets S  , A , reward function R , and dynamics Pr . The  dynamics Pr refers to a family of transition distributions Pr( s,a, · ) , where Pr( s,a,s  ) is the probability with whichstate s  is reached when action a is taken at s . R ( s,r ) de-notes the probability with which reward r is obtained whenstate s is reached. The agent is charged with constructingan optimal Markovian policy π : S → A that maximizesthe expected sum of future discounted rewards over an in-finite horizon: E  π [  ∞ t =0 γ  t R t | S  0 = s ] . This policy, and itsvalue, V   ∗ ( s ) at each s ∈ S  , can be computed using stan-dard algorithms such as policy or value iteration.In the RL setting, the agent does not have direct accessto R and Pr , so it must learn a policy based on its interac-tions with the environment. Any of a number of RL tech-niques can be used to learn an optimal policy.In model-basedRL methods,the learner maintains an es-timatedMDP S  , A ,   R,   Pr  , basedon the set of experiences  s,a,r,t  obtained so far. At each stage (or at suitable in-tervals) this MDP can be solved (exactlyor approximately).Single-agentBayesian methods[4] allow agents to incorpo-rate priors and explore optimally, assuming some prior den-sity P  over possible dynamics D and reward distributions R , which is updated with each data point  s,a,r,t  .Similarly, multi-agent Bayesian RL agents [3] updateprior distributions over the space of possible models as wellas the space of possible strategies being employed by otheragent. The value of performing an action at a belief stateinvolves two main components: an expected value with re-spectto the currentbeliefstate; andits impacton thecurrentbelief state. The first component is typical in RL, while thesecond captures the expected value of information (EVOI)of an action. Each action gives rise to some “response” bythe environment that changes the agent’s beliefs, and thesechanges can influence subsequent action choice and ex-pectedreward.EVOIneednotbe computeddirectly,butcanbe combined with “object-level” expected value throughBellman equations describing the solution to the POMDPthatrepresentstheexploration-exploitationproblembycon-version to a belief state MDP. 3. A Bayesian Coalition Formation Model In this section we introduce the problem of Bayesiancoalition formation, define the Bayesian core, and describea dynamic coalition formation process for this setting. 3.1. The Model A Bayesian coalitionformation problem is characterizedby a set of agents, a set of types, a set of coalitional actions,a set of outcomes or states, a reward function,and agent be-liefs over types. We describe each of these components inturn.We assume a set of agents N  = { 1 ,...,n } , and for eachagent i a finite set of possible types T  i . Each agent i has aspecific type t ∈ T  i , which intuitively captures i ’s “abili-ties” (in a way that will become apparent when we describeactions). We let T  = × i ∈ N  T  i denote the set of type pro-files. For any coalition C  ⊆ N  , T  C  = × i ∈ N  T  i , and for any i ∈ N  , T  − i = × j  = i T  j . Each i knows its own type t i , butnot those of other agents. Agent i ’s beliefs B i comprise a joint distribution over T  − i , where B i (  t − i ) is the probabil-ity i assigns to other agents having type profile  t − i . We use B i (  t C  ) to denote the marginal of  B i over any subset C  of agents, and for ease of notation, we let B i ( t i ) refer to i “be-liefs” about its own type (assigning probability 1 to its ac-tual type and 0 to all others).A coalition C  has available to it a finite set of  coalitionalactions A C  . When an action is taken, it results in some out-comeor state s ∈ S  . The odds with which an outcomeis re-alized depends on the types of the coalition members (e.g.,the outcome of building a house will depend on the abili-ties of the team members). We let Pr( s | α, t C  ) denote theprobability of outcome s given that coalition C  takes action α ∈ A C  and member types are given by  t C  ∈ T  C  . Finally,we assume that each state s results in some reward  R ( s ) .If  s results from a coalitional action, the members are as-signed R ( s ) , which is assumed to be divisible/transferableamong the members.The value of coalition C  with members of type  t C  is: V   ( C  |  t C  ) = max α ∈ A C  s Pr( s | α, t C  ) R ( s ) = max α ∈ A C Q ( C,α |  t C  ) Unfortunately, this coalition value cannot be used in thecoalitionformationprocess if the agents are uncertainaboutthe types of their potential partners.However,each i has be-liefs about the value of any coalition based on its expecta-tion of this value w.r.t. other agents’s types: V   i ( C  ) = max α ∈ A C   t C ∈ T  C B i (  t C  ) Q i ( C,α |  t C  ) = max α ∈ A C Q i ( C,α ) Note that V   i ( C  ) is not simply the expectationof  V   ( C  ) w.r.t. i ’s belief about types. The expectation Q i of action values(i.e., Q -values)cannotbemovedoutsidethemaxoperator:asingleactionmustbechosenwhichisuseful given i ’suncer-tainty. Of course, i ’s estimate of the value of a coalition, orany coalitional action, may not conform with those of otheragents. This leads to additional complexity when definingsuitable stability concepts. We turn to this issue in the nextsection. However, i is certain of its reservation value , theamount it can attain by acting alone: rv  i = V   i ( { i } ) = max α ∈ A { i }  s Pr( s | α,t i ) R ( s )  3.2. The Bayesian Core We define an analog of the traditional core concept forthe Bayesian coalition formation scenario. The notion of stability is made somewhatmoredifficult by the uncertaintyassociated with actions: since the payoffs associated withcoalitional actions are stochastic, allocations must reflectthis [14, 13]. Stability is rendered much more complex stillby the fact that different agents have potentially differentbeliefs about the types of other agents.Because of the stochastic nature of payoffs, we assumethat players join a coalition with certain relative payoff demands [14]. Let  d represent the payoff demand vector  d 1 ,...,d n  , and  d C  the demands of those players in coali-tion C  . For any agent i ∈ C  we define the relative demandof agent to be r i = d i P j ∈ C d j . If reward R is received bycoalition C  as a result of its choice of action, each i re-ceives payoff  r i R . This means that the excesses or lossesderiving from the fact that the reward function is stochasticare expected to be allocated to the agents in proportion totheir agreed upon demands. As such, each agent has beliefsabout any other agent’s expected payoff given a coalitionstructure and demand vector. Specifically, agent i ’s beliefsabout the expected stochastic payoff  of some agent j ∈ C  is denoted ¯  p ij = r j V   i ( C  ) . If  i ∈ C  , i believes its own ex-pected payoff to be ¯  p ii = r i V   i ( C  ) .A difficulty with using V   i ( C  ) in the above definition of expected stochastic payoff is that it assumes that all coali-tion members agree with i ’s assessment of the best (ex-pected reward-maximizing) action for C  . Instead, we sup-pose that coalitions are formed using a process by whichsome coalitional action α is agreed upon, much like de-mands. In this case, i ’s beliefs about j ’s expected payoff is ¯  p ij ( α,C  ) = r j Q i ( C,α ) . Finally, we let ¯  p ij,C  ( α, d C  ) de-note i ’s beliefs about j ’s expected payoff if it were a mem-ber of any C  ⊆ N  with demand  d C  taking action α : ¯  p ij,C  ( α, d C  ) = d j Q i ( C,α )  k ∈ C  d k Intuitively, if a coalition structure and payoff allocationare stable, we would expect: (a) no agent believes it will re-ceive a payoff (in expectation) that is less than its reserva-tion value; and (b) based on its beliefs, no agent will havean incentive to suggest that the coalition structure (or its al-location or action choice) is changed—specifically, there isno alternative coalition it could reasonably expect to jointhat offers it a better payoff than it expects to receive giventhe action choice and allocation agreed upon by the coali-tion to which it belongs.Thus we define the Bayesian core (BC) as follows: Defn 2 Let  CS  , d  be a coalition-structure,demandvectorpair, with C  i denoting the C  ∈ CS  of which i is a member.Then  CS  , d  is in the Bayesian core of a Bayesian coali-tion problem iff, for all C  ∈ CS  , there exists an α ∈ A C  such that, for no S  ⊆ N  is there an action β  ∈ A S  and de-mand vector  d S  s.t. ¯  p ii,S  ( β, d S  ) > ¯  p ii ( α,C  i ) , ∀ i ∈ S  .In words, all agents in every C  ∈ CS  believe that the coali-tionstructureandpayoffallocationcurrentlyinplaceensurethemexpectedpayoffsthatare as goodas anytheymightre-alize in any other coalition. 3 Furthermore, their beliefs “co-incide” in the weak sense that there is some coalitional ac-tion α that they commonlybelieve to ensure this better pay-off. This doesn’t mean that α is what each believes to bebest.Butanagreementtodo α is enoughtokeep each mem-ber of  C  from defecting.The core isaspecialcaseoftheBCwhenallagentsknowthe types of other agents (which is the only way their be-liefs can coincide, since each agent knows its own type). Inthis case, all beliefs about coalitional values coincide, andthe BC coincides with the core of the induced characteris-tic function game. Since the core is not always non-empty,it follows that the BC is not always non-empty. 3.3. Dynamic Coalition Formation We now propose a protocol for dynamic coalition for-mation. The protocol is derived from the process definedin [5], with two main differences: it deals with expected,rather than certain, coalitional values; and it allows for theproposal of a coalitional action during formation.The process proceeds in stages. At any point in time, wesuppose there exists a structure CS  , demand vector  d , and aset of agreed upon coalition actions  α CS  (with one α ∈ A C  for each C  ∈ CS  ). 4 With some probability γ  i , agent i , the  proposer  , is given the opportunity to propose a change tothe current structure. We assume γ  i > 0 for each i ∈ N  ,and permit i the following options: it can propose to stay inits current coalition, but propose a new demand d i and/or anew coalitional action; or it can propose to join any otherexisting coalition with a new demand d i and a suggestedcoalitional action. The second option includes the possibil-ity that i “breaks away” into a singleton. If  i proposes achangeto the current structure/demand/action,then the newarrangementwill occuronlyifall “affected”coalitionmem-beragreeto thechange.Otherwise,thecurrentstructureandagreements remain in force.To reflect the rationality of the players, we impose re-strictions on the possible proposal and acceptance deci-sions. Specifically,we requirethe proposerto suggest a new 3 Note that if some S  , β  , and  d S exist that makes some i ∈ S  strictlybetter off while keeping all other j ∈ S  equally well off, then theremust exist a  d S that makes all j ∈ S  strictly better off.4 We might initially start with singleton coalitions with the obviouschoices of actions.  demand that maximizes its payoff, while taking into con-sideration its beliefs about whether affected agents will ac-cept this demand. Thus for any coalition it proposes to join(or new demand it makes of its own coalition), it will ask for the maximum demand that it believes affected memberswill find acceptable.Let ¯  p ii,C  ( α,d i ) = ¯  p ii,C  ( α, d C  ◦ d i ) denote i ’s beliefsabout its expected payoff should it join coalition C  ∈ CS  with demand d i (or make a new demand of its own coali-tion), with C  ∪{ i } taking action α . When proposing to join C  , i should make the maximum demand ( d i and α ) that is  feasible according to its beliefs, in other words, that it be-lieves the other agents will accept. More precisely, we say  C,d i ,α  is feasible for i if: ∀  j ∈ C,d j Q i ( C  ∪{ i } ,α )  k ∈ C  ∪{ i } , s . t .k  = i d k + d i ≥ ¯  p ij If   C,d i ,α  is feasible for i , then i expects the members of  C  to acceptthis demand.Of course, i does not knowthis forsure,since it does not knowwhatthe membersof  C  believe,but has its own estimates of their current values ¯  p ij . 5 Agent i can directly calculate its maximum rational demand w.r.t. C  and action α : d maxi ( C,α ) = min j d j Q i ( C  ∪ { i } ,α ) − ¯  p ij P k ∈ C  ∪{ i } ,k  = i d k ¯  p ij Assumption 1 Let  0 < δ < 1 be a sufficiently small small-est accounting unit . When any i makes a demand   C,d i ,α  to coalition C   , its payoff demand  d i is restricted to the fi-nite set  D i ( C,α ) of all integral multiples of  δ  in the closed interval [ rv  i ,d maxi ( C,α )] . Withthis modelinplace,we candefinetworelatedcoali-tion formation processes. In the best reply (BR) process ,proposersare chosen randomlyas described above, and anyproposer i is requiredto makeits maximalfeasible demand: max C  max α ∈ A C max d i ∈ D i ( C,α ) ¯  p ii,C  ( α,d i ) . If there are several maximal feasible demands, i choosesamong them with equal probability. As above, such a pro-posal is accepted only in all members of affected coalitionare no worse off in expectation (w.r.t. their own beliefs).This best reply process induces a discrete-time, finite-state Markov chain with states of the form  CS  t , α t , d t  .This state at time t is sufficient to determine the probabil-ity of transitioning to any new state at time t + 1 .We also consider a slight modification of the best replyprocess, the best reply with experimentation (BRE) process . 5 This poses the interesting question of how best to model one agent’sbeliefs about another’s beliefs in this setting. It proceeds similarly to BR with the following exception:if proposer i believes there is a coalition S  which will bebeneficial to it, but which cannot be derived starting fromthe existing CS  , it can propose an arbitrary feasible de-mand in order to destabilize the current state in hopes of reaching a better structure. More precisely, the best reply ischosen with probability 1 − ε , and some other feasible de-mand with arbitrarily small d i ≥ 0 is chosen with proba-bility ε (each with equal probability). This can be viewedas a “trembling” mechanism or as explicit experimentation.Furthermore,any agent  j that is part of an affected coalitionwill choose to accept a demand from i that lowers its pay-off with probability ε iff  j believes there exists some coali-tion S  , with i,j ∈ S  , such that all members of  S  are betteroff than currently (i.e., V   j ( S  ) >  k ∈ S  ¯  p jk ).The BRE process has some reasonable properties. Firstwe note that absorbing states of the process coincide withBayesian core allocations. Theorem 1 The set of demand vectors associated with anabsorbing state of the BRE process coincides with the set of BC allocations. Specifically, ω =  CS  , d, α CS   is an ab-sorbingstate of the BRE process iff   CS  , d  ∈ BC andeach α C  ,C  ∈ CS  satisfies the stability requirement.Proof sketch: Part ( i ) : If a state ω is in the BC, no agentbe-lieves that he can gain either by switching coalitions or bychanging his demand. Moreover, as no agent believes that a“blocking” coalition exists, no agent experiments.Part ( ii ) : Suppose that ω =  CS  , d, α CS   is a non-BCabsorbing state of the BRE process. Since it is not in theBC, then there exists some i that believes there exists an S  and α  s.t. ¯  p ii,S  ( α  ) > ¯  p ii,C  ( α iCS  ) . Consequently,with prob-ability  , at least i will experiment, potentially asking forzero payoff. Thus, there exists a positive probability that ω will be left, which contradicts the statement that ω is ab-sorbing.Theorem 1 does not guarantee that a BC allocation willactually be reached by the BRE process. However, we canprove the following theorem: Theorem 2 If the BC is non-empty, the BRE process willconverge to an absorbing state with probability one.Proof sketch: The proof is completely analogous to theproof for the deterministic coalition formation model [5].The basic idea is that when the BC is not empty, all ergodicsets reached by the BRE process are singletons, thereforethe BRE process will converge to an absorbing state.Theorems 1 and 2 together ensure that if the BC is notempty then the BRE process will eventually reach a BC al-location, no matter what the initial coalition structure.To test the validity of this approach empirically, we ex-amined the BRE process on several simple Bayesian coali-
Search
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x