Decentralized Q-Learning in Zero-sum Markov Games
- URL: http://arxiv.org/abs/2106.02748v1
- Date: Fri, 4 Jun 2021 22:42:56 GMT
- Title: Decentralized Q-Learning in Zero-sum Markov Games
- Authors: Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Basar, Asuman
Ozdaglar
- Abstract summary: We study multi-agent reinforcement learning (MARL) in discounted zero-sum Markov games.
We develop for the first time a radically uncoupled Q-learning dynamics that is both rational and convergent.
The key challenge in this decentralized setting is the non-stationarity of the learning environment from an agent's perspective.
- Score: 33.81574774144886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study multi-agent reinforcement learning (MARL) in infinite-horizon
discounted zero-sum Markov games. We focus on the practical but challenging
setting of decentralized MARL, where agents make decisions without coordination
by a centralized controller, but only based on their own payoffs and local
actions executed. The agents need not observe the opponent's actions or
payoffs, possibly being even oblivious to the presence of the opponent, nor be
aware of the zero-sum structure of the underlying game, a setting also referred
to as radically uncoupled in the literature of learning in games. In this
paper, we develop for the first time a radically uncoupled Q-learning dynamics
that is both rational and convergent: the learning dynamics converges to the
best response to the opponent's strategy when the opponent follows an
asymptotically stationary strategy; the value function estimates converge to
the payoffs at a Nash equilibrium when both agents adopt the dynamics. The key
challenge in this decentralized setting is the non-stationarity of the learning
environment from an agent's perspective, since both her own payoffs and the
system evolution depend on the actions of other agents, and each agent adapts
their policies simultaneously and independently. To address this issue, we
develop a two-timescale learning dynamics where each agent updates her local
Q-function and value function estimates concurrently, with the latter happening
at a slower timescale.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.