Related papers: Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2404.10728v1
Date: Tue, 16 Apr 2024 17:01:38 GMT
Title: Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning
Authors: Hao-Lun Hsu, Weixin Wang, Miroslav Pajic, Pan Xu,
Abstract summary: We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL) We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (textiti.e., $N$-chain), a video game, and a real-world problem in energy systems.
Score: 15.46907000938726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$ regret bound with communication complexity $\widetilde{\mathcal{O}}(dHM^2)$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, and $K$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (\textit{i.e.,} $N$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.

Related papers

Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs [40.06714252547274]
We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration.<n>We focus on a finite-horizon MDP and adopt a phased learning framework.
arXiv Detail & Related papers (2026-02-01T21:44:11Z)
Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing [38.883540444516605]
DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents. We conduct both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks. We propose an algorithm called DistMT-LSVI, where each agent independently learns $epsilon$-optimal policies for all $M$ tasks.
arXiv Detail & Related papers (2023-07-11T22:58:53Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration. We show that this separation does not exist in the setting of linear MDPs. We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z)
Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning [82.31436758872715]
We develop an algorithm that achieves the same PAC guarantee while using only $O(1)$ episodes of environment interactions. We establish a connection between value functions in discounted and finite-horizon Markov decision processes.
arXiv Detail & Related papers (2021-11-01T00:21:24Z)
On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. We tackle this problem under the context of function approximation, leveraging powerful function approximators. We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
A Provably Efficient Sample Collection Strategy for Reinforcement Learning [123.69175280309226]
One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. We propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., sparse simulator of the environment); 2) An "objective-agnostic" sample collection responsible for generating the prescribed samples as fast as possible.
arXiv Detail & Related papers (2020-07-13T15:17:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.