Talk: Learning efficiency of outcome in games

By Eva Tardos
Repeated games
- player’s value/cost additive over periods, while playing
- players try to learn what is the best from past data
- what can we say about the outcome? how long do they have to stay to ensure OK social welfare?
Result: routing, limit for very small users
- Theorem:
  - In any network with continuous, non-decreasing cost functions and small users
  - cost of Nash with rates ri for all i <= cost of opt with rate 2ri for all i
- Nash equilibrium: stable solution where no player had incentive to deviate.
- Price of Anarchy = cost of worst Nash equilibrium / social optimum cost;
Examples of price of anarchy bounds
Price of anarchy in auctions
- First price is auction
- All pay auction…
- Other applications include:
  - public goods
  - fair sharing
  - Walrasian Mechanism
Repeated game that is slowly changing
- Dynamic population model
  - at each step t each player I is replaced with an arbitrary new player with probability p
  - in a population of N players, each step, Np players replaced in expectation
  - population changes all the time: need to adjust
  - players stay long enough…
Learning in repeated game
- what is learning?
- Does learning lead to finding Nash equilibrium?
- fictitious play = best respond to past history of other players goal: “pre-play” as a way to learn to play Nash
Find a better idea when the game is playing?
- Change of focus: outcome of learning in playing
Nash equilibrium of the one shot game?
- Nash equilibrium of the one-shot game: stable actions a with no regret for any alternate strategy x.
- cost_i(x, a_-i) >= cost_i(a)
Behavior is far from stable
no regret without stability: learning
- no regret: for any fixed action x (cost \in [0,1]):
  - sum_t(cost_i(a^t)) <= sum_t(cost_i(x, a_-i^t)) + error
  - error <= √T (if o(T) called no-regret)
Outcome of no-regret learning in a fixed game
- limit distribution sigma of play (action vectors a = (a1, a2,…,an))
No regret leanring as a behavior model:
- Pro:
  - no need for common prior or rationality assumption on opponents
  - behavioral assumption: if there is a consistently good strategy: please notice!
  - algorithm: many simple rules ensure regret approx.
  - Behavior model ….
Distribution of smallest rationalizable multiplicative regret
- strictly positive regret: learning phase maybe better than no-regret
Today (with d options):
- sum_t(cost_i(a^t)) <= sum_t(cost_i(x, a_-i^t)) + √Tlogd
- sum_t(cost_i(a^t)) <= (1 + epsilon)sum_t(cost_i(x, a_-i^t)) + log(d)/epsilon
Quality of learning outcome
Proof technique: Smoothness
Learning and price of anarchy
Learning in dynamic games
- Dynamic population model
  - at each step t each player I is replaced with an arbitrary new player with probability p
- how should they learn from data?
Need for adaptive learning
Adapting result to dynamic populations
- inequality we wish to have
Change in optimum solution
Use differential privacy -> stable solution

Leave a Reply Cancel reply