V-Learning -- A Simple, Efficient, Decentralized Algorithm for
Multiagent RL
- URL: http://arxiv.org/abs/2110.14555v1
- Date: Wed, 27 Oct 2021 16:25:55 GMT
- Title: V-Learning -- A Simple, Efficient, Decentralized Algorithm for
Multiagent RL
- Authors: Chi Jin, Qinghua Liu, Yuanhao Wang, Tiancheng Yu
- Abstract summary: V-learning is a new class of single-agent RL algorithms that convert any adversarial bandit algorithm with suitable regret guarantees into a RL algorithm.
Unlike Q-learning, it only maintains the estimates of V-values instead of Q-values.
- Score: 35.304241088947116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A major challenge of multiagent reinforcement learning (MARL) is the curse of
multiagents, where the size of the joint action space scales exponentially with
the number of agents. This remains to be a bottleneck for designing efficient
MARL algorithms even in a basic scenario with finitely many states and actions.
This paper resolves this challenge for the model of episodic Markov games. We
design a new class of fully decentralized algorithms -- V-learning, which
provably learns Nash equilibria (in the two-player zero-sum setting),
correlated equilibria and coarse correlated equilibria (in the multiplayer
general-sum setting) in a number of samples that only scales with
$\max_{i\in[m]} A_i$, where $A_i$ is the number of actions for the $i^{\rm th}$
player. This is in sharp contrast to the size of the joint action space which
is $\prod_{i=1}^m A_i$. V-learning (in its basic form) is a new class of
single-agent RL algorithms that convert any adversarial bandit algorithm with
suitable regret guarantees into a RL algorithm. Similar to the classical
Q-learning algorithm, it performs incremental updates to the value functions.
Different from Q-learning, it only maintains the estimates of V-values instead
of Q-values. This key difference allows V-learning to achieve the claimed
guarantees in the MARL setting by simply letting all agents run V-learning
independently.
Related papers
- Breaking the Curse of Multiagents in a Large State Space: RL in Markov
Games with Independent Linear Function Approximation [56.715186432566576]
We propose a new model, independent linear Markov game, for reinforcement learning with a large state space and a large number of agents.
We design new algorithms for learning correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample bounds complexity that only scalely with each agent's own function class complexity.
Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; and (2) separating learning Markov equilibria and exploration in the Markov games.
arXiv Detail & Related papers (2023-02-07T18:47:48Z) - Representation Learning for General-sum Low-rank Markov Games [63.119870889883224]
We study multi-agent general-sum Markov games with nonlinear function approximation.
We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation.
arXiv Detail & Related papers (2022-10-30T22:58:22Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Decentralized Cooperative Multi-Agent Reinforcement Learning with
Exploration [35.75029940279768]
We study multi-agent reinforcement learning in the most basic cooperative setting -- Markov teams.
We propose an algorithm in which each agent independently runs a stage-based V-learning style algorithm.
We show that the agents can learn an $epsilon$-approximate Nash equilibrium policy in at most $proptowidetildeO (1/epsilon4)$ episodes.
arXiv Detail & Related papers (2021-10-12T02:45:12Z) - Provably Efficient Reinforcement Learning in Decentralized General-Sum
Markov Games [5.205867750232226]
This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games.
We propose an algorithm in which each agent independently runs optimistic V-learning to efficiently explore the unknown environment.
We show that the agents can find an $epsilon$-approximate CCE in at most $widetildeO( H6S A /epsilon2)$ episodes.
arXiv Detail & Related papers (2021-10-12T02:01:22Z) - Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves.
Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z) - QVMix and QVMix-Max: Extending the Deep Quality-Value Family of
Algorithms to Cooperative Multi-Agent Reinforcement Learning [10.334745043233974]
This paper introduces four new algorithms for tackling multi-agent reinforcement learning problems.
All algorithms are based on the Deep Quality-Value family of algorithms.
We show competitive results when QVMix and QVMix-Max are compared to well-known MARL techniques.
arXiv Detail & Related papers (2020-12-22T14:53:42Z) - A Sharp Analysis of Model-based Reinforcement Learning with Self-Play [49.88233710867315]
We present a sharp analysis of model-based self-play algorithms for multi-agent Markov games.
We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games.
We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games.
arXiv Detail & Related papers (2020-10-04T15:27:39Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.