Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
- URL: http://arxiv.org/abs/2510.13060v1
- Date: Wed, 15 Oct 2025 01:00:54 GMT
- Title: Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
- Authors: Anupam Nayak, Tong Yang, Osman Yagan, Gauri Joshi, Yuejie Chi,
- Abstract summary: We develop and analyze algorithms that provably achieve improved sample efficiency under Reverse Kullback-Leibler (KL) regularization.<n>We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG.<n>Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $beta$ in addition to the standard $widetildemathcalO(sqrtT)
- Score: 53.447182734351
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $\beta$ in addition to the standard $\widetilde{\mathcal{O}}(\sqrt{T})$ regret independent of $\beta$ which is attained in both regularized and unregularized settings
Related papers
- Near Optimal Convergence to Coarse Correlated Equilibrium in General-Sum Markov Games [0.0]
No-regret learning dynamics play a central role in game theory, enabling decentralized convergence to equilibrium.<n>We improve the convergence rate to CCE in general-sum games, reducing it from the previously best-known rate of $mathcalO(log5 T / T)$ to a sharper $mathcalO(log T / T)$.<n>This matches the best known convergence rate for CE in terms of $T$, number of iterations, while also improving the dependence on the action set size.
arXiv Detail & Related papers (2025-11-04T00:54:54Z) - Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [10.617854230082896]
Group Relative Policy Optimization was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards.<n>We analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence.
arXiv Detail & Related papers (2025-03-09T14:36:45Z) - RSPO: Regularized Self-Play Alignment of Large Language Models [54.593523736962]
Regularized Self-Play Policy Optimization (RSPO) is a general and modular framework that unifies prior methods and enables plug-and-play integration of various regularizers.<n>Our empirical study involving over $120$ fine-tuned Mistral-7B-Instruct models reveals that forward KL divergence regularization reduces response length, whereas reverse KL divergence markedly improves raw win rates.
arXiv Detail & Related papers (2025-02-24T22:43:21Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.<n>Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.<n>We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP)
WARP merges policies in the weight space at three distinct stages.
Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z) - Local and adaptive mirror descents in extensive-form games [37.04094644847904]
We study how to learn $epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback.
We consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy.
We show that this approach guarantees a convergence rate of $tildemathcalO(T-1/2)$ with high probability and has a near-optimal dependence on the game parameters.
arXiv Detail & Related papers (2023-09-01T09:20:49Z) - Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence [22.400759435696102]
We investigate a generalized KL divergence, called the Tsallis KL divergence, which use the $q$-logarithm in the definition.
We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial.
We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
arXiv Detail & Related papers (2023-01-27T00:31:51Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Policy Optimization for Markov Games: Unified Framework and Faster
Convergence [81.3266426402464]
We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game.
We extend this algorithm to multi-player general Markov Games and show a $mathcalwidetildeO(T-1/2)$ convergence rate to Correlated Equilibria (CCE)
arXiv Detail & Related papers (2022-06-06T14:23:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.