Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
- URL: http://arxiv.org/abs/2510.18183v1
- Date: Tue, 21 Oct 2025 00:14:45 GMT
- Title: Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria
- Authors: Eason Yu, Tzu Hao Liu, Yunke Wang, Clément L. Canonne, Nguyen H. Tran, Chang Xu,
- Abstract summary: We develop a practical algorithm for finding Nash equilibria in imperfect-information games.<n>Nash Policy Gradient (NashPG) achieves comparable or lower exploitability than prior model-free methods on classic benchmark games.
- Score: 27.756691720415798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finding Nash equilibria in imperfect-information games remains a central challenge in multi-agent reinforcement learning. While regularization-based methods have recently achieved last-iteration convergence to a regularized equilibrium, they require the regularization strength to shrink toward zero to approximate a Nash equilibrium, often leading to unstable learning in practice. Instead, we fix the regularization strength at a large value for robustness and achieve convergence by iteratively refining the reference policy. Our main theoretical result shows that this procedure guarantees strictly monotonic improvement and convergence to an exact Nash equilibrium in two-player zero-sum games, without requiring a uniqueness assumption. Building on this framework, we develop a practical algorithm, Nash Policy Gradient (NashPG), which preserves the generalizability of policy gradient methods while relying solely on the current and reference policies. Empirically, NashPG achieves comparable or lower exploitability than prior model-free methods on classic benchmark games and scales to large domains such as Battleship and No-Limit Texas Hold'em, where NashPG consistently attains higher Elo ratings.
Related papers
- Unifying Stable Optimization and Reference Regularization in RLHF [64.16830602324345]
This paper introduces a unified regularization approach that balances objectives of preventing reward hacking and maintaining stable policy updates.<n>Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity.
arXiv Detail & Related papers (2026-02-12T03:31:19Z) - Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning [3.973277434105709]
We rigorously analyze a policy execution framework that samples actions from a policy at discrete time points and implements them as piecewise constant controls.<n>We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients according to the policy.<n>Building on these results, we analyze the bias and variance of various policy gradient estimators based on discrete-time observations.
arXiv Detail & Related papers (2025-03-13T02:35:23Z) - COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences [57.70902561665269]
We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences.<n>We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets.
arXiv Detail & Related papers (2024-10-30T17:13:02Z) - A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence [21.195897792629548]
We show for the first time that a policy gradient method leads to provable best-iterate convergence to a regularized Nash equilibrium in self-play.
arXiv Detail & Related papers (2024-08-01T17:54:01Z) - Learning in Zero-Sum Markov Games: Relaxing Strong Reachability and Mixing Time Assumptions [11.793922711718645]
We address payoff-based decentralized learning in infinite-horizon zero-sum Markov games.<n>In this setting, each player makes decisions based solely on received rewards, without observing the opponent's strategy or actions nor sharing information.<n>By establishing novel properties of the value and policy updates induced by the Tsallis entropy regularizer, we prove finite-time convergence to an approximate Nash equilibrium.
arXiv Detail & Related papers (2023-12-13T09:31:30Z) - PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium [58.26573117273626]
We consider the non-AL equilibrium nonconptotic objective function in two-player zero-sum continuous games.
Our novel insights into the particle-based algorithms for continuous distribution strategies are presented.
arXiv Detail & Related papers (2023-03-02T05:08:15Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - On the convergence of policy gradient methods to Nash equilibria in
general stochastic games [33.786186304912]
We study the long-run behavior of policy gradient methods with respect to Nash equilibrium policies.
We show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $mathcalO (1/sqrtn)$ distance-squared convergence rate.
arXiv Detail & Related papers (2022-10-17T08:51:59Z) - Approximate Nash Equilibrium Learning for n-Player Markov Games in
Dynamic Pricing [0.0]
We investigate Nash equilibrium learning in a competitive Markov Game (MG) environment.
We develop a new model-free method to find approximate Nash equilibria.
We demonstrate that an approximate Nash equilibrium can be learned, particularly in the dynamic pricing domain.
arXiv Detail & Related papers (2022-07-13T19:27:07Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Learning Nash Equilibria in Zero-Sum Stochastic Games via
Entropy-Regularized Policy Approximation [18.35524179586723]
We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zero-sum games.
We propose a new Q-learning type algorithm that uses a sequence of entropy-regularized soft policies to approximate the Nash policy.
We prove that under certain conditions, by updating the regularized Q-function, the algorithm converges to a Nash equilibrium.
arXiv Detail & Related papers (2020-09-01T01:03:44Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.