Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games
- URL: http://arxiv.org/abs/2206.01588v1
- Date: Fri, 3 Jun 2022 14:18:05 GMT
- Title: Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret
Learning in Markov Games
- Authors: Wenhao Zhan, Jason D. Lee, Zhuoran Yang
- Abstract summary: We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents.
We propose a new algorithm, underlineDecentralized underlineOptimistic hypeunderlineRpolicy munderlineIrror deunderlineScent (DORIS)
DORIS achieves $sqrtK$-regret in the context of general function approximation, where $K$ is the number of episodes.
- Score: 95.10091348976779
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study decentralized policy learning in Markov games where we control a
single agent to play with nonstationary and possibly adversarial opponents. Our
goal is to develop a no-regret online learning algorithm that (i) takes actions
based on the local information observed by the agent and (ii) is able to find
the best policy in hindsight. For such a problem, the nonstationary state
transitions due to the varying opponent pose a significant challenge. In light
of a recent hardness result \citep{liu2022learning}, we focus on the setting
where the opponent's previous policies are revealed to the agent for decision
making. With such an information structure, we propose a new algorithm,
\underline{D}ecentralized \underline{O}ptimistic hype\underline{R}policy
m\underline{I}rror de\underline{S}cent (DORIS), which achieves
$\sqrt{K}$-regret in the context of general function approximation, where $K$
is the number of episodes. Moreover, when all the agents adopt DORIS, we prove
that their mixture policy constitutes an approximate coarse correlated
equilibrium. In particular, DORIS maintains a \textit{hyperpolicy} which is a
distribution over the policy space. The hyperpolicy is updated via mirror
descent, where the update direction is obtained by an optimistic variant of
least-squares policy evaluation. Furthermore, to illustrate the power of our
method, we apply DORIS to constrained and vector-valued MDPs, which can be
formulated as zero-sum Markov games with a fictitious opponent.
Related papers
- When is Agnostic Reinforcement Learning Statistically Tractable? [76.1408672715773]
A new complexity measure, called the emphspanning capacity, depends solely on the set $Pi$ and is independent of the MDP dynamics.
We show there exists a policy class $Pi$ with a bounded spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for learnability between generative access and online access models.
arXiv Detail & Related papers (2023-10-09T19:40:54Z) - Learning Diverse Risk Preferences in Population-based Self-play [23.07952140353786]
Current self-play algorithms optimize the agent to maximize expected win-rates against its current or historical copies.
We introduce diversity from the perspective that agents could have diverse risk preferences in the face of uncertainty.
We show that our method achieves comparable or superior performance in competitive games.
arXiv Detail & Related papers (2023-05-19T06:56:02Z) - Decentralized model-free reinforcement learning in stochastic games with
average-reward objective [1.9852463786440127]
We show that our algorithm achieves both sublinear high probability regret of order $T3/4$ and sublinear expected regret of order $T2/3$.
Our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works.
arXiv Detail & Related papers (2023-01-13T15:59:53Z) - Autoregressive Bandits [58.46584210388307]
We propose a novel online learning setting, Autoregressive Bandits, in which the observed reward is governed by an autoregressive process of order $k$.
We show that, under mild assumptions on the reward process, the optimal policy can be conveniently computed.
We then devise a new optimistic regret minimization algorithm, namely, AutoRegressive Upper Confidence Bound (AR-UCB), that suffers sublinear regret of order $widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G
arXiv Detail & Related papers (2022-12-12T21:37:36Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Byzantine-Robust Online and Offline Distributed Reinforcement Learning [60.970950468309056]
We consider a distributed reinforcement learning setting where multiple agents explore the environment and communicate their experiences through a central server.
$alpha$-fraction of agents are adversarial and can report arbitrary fake information.
We seek to identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents.
arXiv Detail & Related papers (2022-06-01T00:44:53Z) - Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method [6.261762915564555]
We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work.
In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others.
The policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively.
arXiv Detail & Related papers (2021-10-31T09:08:46Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.