Related papers: Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

URL: http://arxiv.org/abs/2505.22781v1
Date: Wed, 28 May 2025 18:50:25 GMT
Title: Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games
Authors: Antonio Ocello, Daniil Tiapkin, Lorenzo Mancini, Mathieu Laurière, Eric Moulines,
Abstract summary: We introduce a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces.<n>Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence.<n>This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.
Score: 14.104031043622351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.

Related papers

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning [49.25750348525603]
BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions into dynamic, probability-aware clipping intervals.<n>We show that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
arXiv Detail & Related papers (2026-03-05T08:03:05Z)
Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic [12.256817975993128]
Relative policy optimization is a core methodological component of DeepSeekMath and DeepSeek-R1.<n>This paper provides a unified framework to understand GRPO through the lens of classical U-statistics.
arXiv Detail & Related papers (2026-03-01T15:56:43Z)
Policy Regularized Distributionally Robust Markov Decision Processes with Linear Function Approximation [10.35045003737115]
Decision-making under distribution shift is a central challenge in reinforcement learning (RL), where training and deployment environments differ.<n>We propose DR-RPO, a model-free online policy optimization method that learns robust policies with sublinear regret.<n>We show that DR-RPO can achieve suboptimality bounds and sample efficiency in robust RL, matching the performance of value-based approaches.
arXiv Detail & Related papers (2025-10-16T02:56:58Z)
Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z)
Distributed optimization: designed for federated learning [6.642087364775619]
Federated learning (FL) is a distributed collaborative machine learning framework under privacy-preserving constraints.<n>This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique.<n> Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings.
arXiv Detail & Related papers (2025-08-12T03:39:07Z)
PPO in the Fisher-Rao geometry [0.0]
Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning.<n>Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence.<n>In this paper, we derive a tighter surrogate in the Fisher-Rao geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO)
arXiv Detail & Related papers (2025-06-04T09:23:27Z)
RL-finetuning LLMs from on- and off-policy data with a single algorithm [53.70731390624718]
We introduce a novel reinforcement learning algorithm (AGRO) for fine-tuning large-language models.<n>AGRO leverages the concept of generation consistency, which states that the optimal policy satisfies the notion of consistency across any possible generation of the model.<n>We derive algorithms that find optimal solutions via the sample-based policy gradient and provide theoretical guarantees on their convergence.
arXiv Detail & Related papers (2025-03-25T12:52:38Z)
Efficient and Scalable Deep Reinforcement Learning for Mean Field Control Games [16.62770187749295]
Mean Field Control Games (MFCGs) provide a powerful theoretical framework for analyzing systems of infinitely many interacting agents.<n>This paper presents a scalable deep Reinforcement Learning (RL) approach to approximate equilibrium solutions of MFCGs.
arXiv Detail & Related papers (2024-12-28T02:04:53Z)
Maximum Causal Entropy IRL in Mean-Field Games and GNEP Framework for Forward RL [2.867517731896504]
This paper explores the use of Causal Entropy In Reinforcement Learning (IRL) within the context of discrete-time reinforcement-Field Games (MFFGs)<n>MFFGs generate data for a non-action state gradient as a Generalized Nash Problem (GNEP)
arXiv Detail & Related papers (2024-01-12T13:22:03Z)
On the Statistical Efficiency of Mean-Field Reinforcement Learning with General Function Approximation [20.66437196305357]
We study the fundamental statistical efficiency of Reinforcement Learning in Mean-Field Control (MFC) and Mean-Field Game (MFG) with general model-based function approximation. We introduce a new concept called Mean-Field Model-Based Eluder Dimension (MF-MBED), which characterizes the inherent complexity of mean-field model classes.
arXiv Detail & Related papers (2023-05-18T20:00:04Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Trust-Region-Free Policy Optimization for Stochastic Policies [60.52463923712565]
We show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) explicit as it is free of any trust region constraints.
arXiv Detail & Related papers (2023-02-15T23:10:06Z)
Bounded Robustness in Reinforcement Learning via Lexicographic Objectives [54.00072722686121]
Policy robustness in Reinforcement Learning may not be desirable at any cost. We study how policies can be maximally robust to arbitrary observational noise. We propose a robustness-inducing scheme, applicable to any policy algorithm, that trades off expected policy utility for robustness.
arXiv Detail & Related papers (2022-09-30T08:53:18Z)
Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO [66.5384483339413]
We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL) We show that a trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training.
arXiv Detail & Related papers (2022-01-31T20:39:48Z)
Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture. We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence. In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z)
CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee [61.176159046544946]
In safe reinforcement learning (SRL) problems, an agent explores the environment to maximize an expected total reward and avoids violation of certain constraints. This is the first-time analysis of SRL algorithms with global optimal policies.
arXiv Detail & Related papers (2020-11-11T16:05:14Z)
Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms. We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization. Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.