Mutation-Driven Follow the Regularized Leader for Last-Iterate
Convergence in Zero-Sum Games
- URL: http://arxiv.org/abs/2206.09254v1
- Date: Sat, 18 Jun 2022 17:32:07 GMT
- Title: Mutation-Driven Follow the Regularized Leader for Last-Iterate
Convergence in Zero-Sum Games
- Authors: Kenshi Abe, Mitsuki Sakamoto, Atsushi Iwasaki
- Abstract summary: Follow the Regularized Leader (FTRL) is guaranteed to converge to a Nash equilibrium when time-averaging the strategies.
We propose mutant FTRL (M-FTRL), an algorithm that introduces mutation for the perturbation of action probabilities.
- Score: 8.347058637480506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we consider a variant of the Follow the Regularized Leader
(FTRL) dynamics in two-player zero-sum games. FTRL is guaranteed to converge to
a Nash equilibrium when time-averaging the strategies, while a lot of variants
suffer from the issue of limit cycling behavior, i.e., lack the last-iterate
convergence guarantee. To this end, we propose mutant FTRL (M-FTRL), an
algorithm that introduces mutation for the perturbation of action
probabilities. We then investigate the continuous-time dynamics of M-FTRL and
provide the strong convergence guarantees toward stationary points that
approximate Nash equilibria under full-information feedback. Furthermore, our
simulation demonstrates that M-FTRL can enjoy faster convergence rates than
FTRL and optimistic FTRL under full-information feedback and surprisingly
exhibits clear convergence under bandit feedback.
Related papers
- Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees [33.46730273409721]
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents.<n>Existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios.<n>We propose SeeUPO, a critic-free approach with convergence guarantees for multi-turn interactions.
arXiv Detail & Related papers (2026-02-06T09:57:23Z) - On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training [10.433802085981046]
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL)<n>We show that RL increases SFT loss under SFT optimality and that SFT lowers the reward achieved by RL.<n> Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training.
arXiv Detail & Related papers (2026-01-12T10:14:09Z) - ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z) - Trust-Region Adaptive Policy Optimization [82.09255251747818]
Post-training methods play an important role in improving large language models' (LLMs) complex reasoning abilities.<n>We introduce TRAPO, a framework that interleavesSupervised Fine-Tuning (SFT) and Reinforcement Learning (RL) within each training instance.<n>Experiments on five mathematical reasoning benchmarks show that TRAPO consistently surpasses standard SFT, RL, and SFT-then-RL pipelines.
arXiv Detail & Related papers (2025-12-19T14:37:07Z) - The Power of Perturbation under Sampling in Solving Extensive-Form Games [56.013335390600524]
This paper investigates how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in imperfect-information extensive-form games.
Perturbing the expected payoffs guarantees that the FTRL dynamics reach an approximate equilibrium.
We show that in the last-iterate sense, the FTRL consistently outperforms the non-samplinged FTRL.
arXiv Detail & Related papers (2025-01-28T00:29:38Z) - Continual SFT Matches Multimodal RLHF with Negative Supervision [32.784161582943874]
Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension.
Conventional wisdom holds its superiority over continual SFT during this preference alignment stage.
We propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided.
arXiv Detail & Related papers (2024-11-22T08:48:30Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations [98.5802673062712]
We introduce temporally-coupled perturbations, presenting a novel challenge for existing robust reinforcement learning methods.
We propose GRAD, a novel game-theoretic approach that treats the temporally-coupled robust RL problem as a partially observable two-player zero-sum game.
arXiv Detail & Related papers (2023-07-22T12:10:04Z) - Generalized Implicit Follow-The-Regularized-Leader [15.974402990630402]
Generalized implicit FTRL can recover known algorithms, as FTRL with linearized losses and implicit FTRL.
We show the flexibility of the framework by proving that some known algorithms, like the Mirror-Prox updates, are instantiations of the generalized implicit FTRL.
arXiv Detail & Related papers (2023-05-31T21:39:52Z) - One-Step Distributional Reinforcement Learning [10.64435582017292]
We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
arXiv Detail & Related papers (2023-04-27T06:57:00Z) - ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for
Last-Iterate Convergence in Constrained MDPs [31.663072540757643]
Reinforcement Learning (RL) has been applied to real-world problems with increasing success.
We introduce Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD)
arXiv Detail & Related papers (2023-02-02T18:05:27Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Regularization Guarantees Generalization in Bayesian Reinforcement
Learning through Algorithmic Stability [48.62272919754204]
We study generalization in Bayesian RL under the probably approximately correct (PAC) framework.
Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense.
arXiv Detail & Related papers (2021-09-24T07:48:34Z) - Combining Pessimism with Optimism for Robust and Efficient Model-Based
Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time.
To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z) - Maximum Entropy RL (Provably) Solves Some Robust RL Problems [94.80212602202518]
We prove theoretically that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function.
Our results suggest that MaxEnt RL by itself is robust to certain disturbances, without requiring any additional modifications.
arXiv Detail & Related papers (2021-03-10T18:45:48Z) - Reinforcement Learning in Non-Stationary Discrete-Time Linear-Quadratic
Mean-Field Games [14.209473797379667]
We study large population multi-agent reinforcement learning (RL) in the context of discrete-time linear-quadratic mean-field games (LQ-MFGs)
Our setting differs from most existing work on RL for MFGs, in that we consider a non-stationary MFG over an infinite horizon.
We propose an actor-critic algorithm to iteratively compute the mean-field equilibrium (MFE) of the LQ-MFG.
arXiv Detail & Related papers (2020-09-09T15:17:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.