Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
- URL: http://arxiv.org/abs/2502.12678v1
- Date: Tue, 18 Feb 2025 09:33:48 GMT
- Title: Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
- Authors: Yongtao Wu, Luca Viano, Yihang Chen, Zhenyu Zhu, Kimon Antonakopoulos, Quanquan Gu, Volkan Cevher,
- Abstract summary: Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a.
We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium.
We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
- Score: 91.88803125231189
- License:
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework~\citep{peters2008natural}. We further develop OMPO based on the optimistic online gradient descent algorithm~\citep{rakhlin2013online,joulani17a}. Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that OMPO requires $\mathcal{O}(\epsilon^{-1})$ policy updates to converge to an $\epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
Related papers
- Best of Both Worlds Guarantees for Smoothed Online Quadratic Optimization [9.449153668916098]
We study the smoothed online optimization (SOQO) problem where, at each round $t$, a player plays an action $x_t in response to a quadratic hitting cost and an additional squared $ell$-norm cost for switching actions.
This problem class has strong connections to a wide range of application domains including smart grid management, adaptive control, and data center management.
We present a best-of-both-worlds algorithm that obtains a robust adversarial performance while simultaneously achieving a near-optimal performance.
arXiv Detail & Related papers (2023-10-31T22:59:23Z) - Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions.
In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z) - Stochastic Principal-Agent Problems: Efficient Computation and Learning [25.637633553882985]
A principal and an agent interact in a environment, each privy to observations about the state not available to the other.
The model encompasses as special cases extensive-form games (EFGs) and approaches games of Markov decision processes (POMDPs)
We show an efficient algorithm for an episodic reinforcement learning setting where transition probabilities are unknown.
arXiv Detail & Related papers (2023-06-06T16:20:44Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - Oracle-free Reinforcement Learning in Mean-Field Games along a Single
Sample Path [5.926203312586109]
We consider online reinforcement learning in Mean-Field Games (MFGs)
We develop an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent.
We empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios.
arXiv Detail & Related papers (2022-08-24T16:22:31Z) - Provably Efficient Fictitious Play Policy Optimization for Zero-Sum
Markov Games with Structured Transitions [145.54544979467872]
We propose and analyze new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions.
We prove tight $widetildemathcalO(sqrtK)$ regret bounds after $K$ episodes in a two-agent competitive game scenario.
Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization.
arXiv Detail & Related papers (2022-07-25T18:29:16Z) - Policy Optimization for Markov Games: Unified Framework and Faster
Convergence [81.3266426402464]
We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game.
We extend this algorithm to multi-player general Markov Games and show a $mathcalwidetildeO(T-1/2)$ convergence rate to Correlated Equilibria (CCE)
arXiv Detail & Related papers (2022-06-06T14:23:13Z) - Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes [13.908826484332282]
We study an online primal-dual actor-critic method to solve a discounted cost constrained Markov decision process problem.
This paper is the first to study the finite-time complexity of an online primal-dual actor-critic method for solving a CMDP problem.
arXiv Detail & Related papers (2021-10-21T18:05:40Z) - Learning Zero-Sum Simultaneous-Move Markov Games Using Function
Approximation and Correlated Equilibrium [116.56359444619441]
We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games.
In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap.
In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret.
arXiv Detail & Related papers (2020-02-17T17:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.