Oracle-free Reinforcement Learning in Mean-Field Games along a Single
Sample Path
- URL: http://arxiv.org/abs/2208.11639v3
- Date: Tue, 11 Apr 2023 17:22:47 GMT
- Title: Oracle-free Reinforcement Learning in Mean-Field Games along a Single
Sample Path
- Authors: Muhammad Aneeq uz Zaman, Alec Koppel, Sujay Bhatt, Tamer Ba\c{s}ar
- Abstract summary: We consider online reinforcement learning in Mean-Field Games (MFGs)
We develop an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent.
We empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios.
- Score: 5.926203312586109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike
traditional approaches, we alleviate the need for a mean-field oracle by
developing an algorithm that approximates the Mean-Field Equilibrium (MFE)
using the single sample path of the generic agent. We call this {\it Sandbox
Learning}, as it can be used as a warm-start for any agent learning in a
multi-agent non-cooperative setting. We adopt a two time-scale approach in
which an online fixed-point recursion for the mean-field operates on a slower
time-scale, in tandem with a control policy update on a faster time-scale for
the generic agent. Given that the underlying Markov Decision Process (MDP) of
the agent is communicating, we provide finite sample convergence guarantees in
terms of convergence of the mean-field and control policy to the mean-field
equilibrium. The sample complexity of the Sandbox learning algorithm is
$\tilde{\mathcal{O}}(\epsilon^{-4})$ where $\epsilon$ is the MFE approximation
error. This is similar to works which assume access to oracle. Finally, we
empirically demonstrate the effectiveness of the sandbox learning algorithm in
diverse scenarios, including those where the MDP does not necessarily have a
single communicating class.
Related papers
- Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a.
We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium.
We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
arXiv Detail & Related papers (2025-02-18T09:33:48Z) - Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [4.899818550820576]
We propose a new algorithm for multi-agent reinforcement learning.
We show that this learned policy converges to the optimal policy on the order of $tildeO (1/sqrtk)$ as the number of subsampled agents increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Theoretical Study of Conflict-Avoidant Multi-Objective Reinforcement Learning [21.288881065839007]
We develop a novel dynamic weighting multi-task actor-critic algorithm (MTAC) under two options of sub-procedures named as CA and FC.
MTAC-CA aims to find a conflict-avoidant (CA) update direction that maximizes the minimum value improvement among tasks, and MTAC-FC targets at a much faster convergence rate.
Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MTRL methods with fixed preference.
arXiv Detail & Related papers (2024-05-25T05:57:46Z) - A Single Online Agent Can Efficiently Learn Mean Field Games [16.00164239349632]
Mean field games (MFGs) are a promising framework for modeling the behavior of large-population systems.
This paper introduces a novel online single-agent model-free learning scheme, which enables a single agent to learn MFNE using online samples.
arXiv Detail & Related papers (2024-05-05T16:38:04Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Regularization of the policy updates for stabilizing Mean Field Games [0.2348805691644085]
This work studies non-cooperative Multi-Agent Reinforcement Learning (MARL)
MARL where multiple agents interact in the same environment and whose goal is to maximize the individual returns.
We name our algorithm Mean Field Proximal Policy Optimization (MF-PPO), and we empirically show the effectiveness of our method in the OpenSpiel framework.
arXiv Detail & Related papers (2023-04-04T05:45:42Z) - Beyond ADMM: A Unified Client-variance-reduced Adaptive Federated
Learning Framework [82.36466358313025]
We propose a primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model.
Experiments based on (semi-supervised) image classification tasks demonstrate superiority of FedVRA over the existing schemes.
arXiv Detail & Related papers (2022-12-03T03:27:51Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.