Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
- URL: http://arxiv.org/abs/2505.20627v1
- Date: Tue, 27 May 2025 02:07:35 GMT
- Title: Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching
- Authors: Zhekun Shi, Kaizhao Liu, Qi Long, Weijie J. Su, Jiancong Xiao,
- Abstract summary: Nash Learning from Human Feedback is a framework for aligning large language models with human preferences by modeling learning as a zero-sum game.<n>In this paper, we study using what choices of payoff based on the pairwise human preferences can yield desirable alignment properties.
- Score: 23.0436612817548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nash Learning from Human Feedback is a game-theoretic framework for aligning large language models (LLMs) with human preferences by modeling learning as a two-player zero-sum game. However, using raw preference as the payoff in the game highly limits the potential of the game-theoretic LLM alignment framework. In this paper, we systematically study using what choices of payoff based on the pairwise human preferences can yield desirable alignment properties. We establish necessary and sufficient conditions for Condorcet consistency, diversity through mixed strategies, and Smith consistency. These results provide a theoretical foundation for the robustness of game-theoretic LLM alignment. Further, we show the impossibility of preference matching -- i.e., no smooth and learnable mappings of pairwise preferences can guarantee a unique Nash equilibrium that matches a target policy, even under standard assumptions like the Bradley-Terry-Luce model. This result highlights the fundamental limitation of game-theoretic LLM alignment.
Related papers
- Multiplayer Nash Preference Optimization [79.15013211640566]
Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences.<n>Recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF)<n>We introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime.
arXiv Detail & Related papers (2025-09-27T04:18:33Z) - Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium [23.0436612817548]
We show that Condorcet cycles exist with probability converging to one exponentially fast under a probabilistic preference model.<n>We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority.<n>We leverage insights from our statistical results to design a novel, computationally efficient algorithm for finding Nash equilibria in aligning LLMs with NLHF.
arXiv Detail & Related papers (2025-03-14T01:29:21Z) - Improving LLM General Preference Alignment via Optimistic Online Mirror Descent [57.622821649679786]
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences.<n>In this paper, we drop the Bradley-Terry (BT) model assumption and study LLM alignment under general preferences, formulated as a two-player game.<n>We show that our approach achieves an $O(T-1)$ bound on the duality gap, improving upon the previous $O(T-1/2)$ result.
arXiv Detail & Related papers (2025-02-24T05:24:52Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment [29.197712664347794]
We introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game.<n>To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation.
arXiv Detail & Related papers (2024-10-22T05:51:34Z) - Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning [55.65738319966385]
We propose a novel online algorithm, iterative Nash policy optimization (INPO)<n>Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses.<n>With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard.
arXiv Detail & Related papers (2024-06-30T08:00:34Z) - Large Language Models Playing Mixed Strategy Nash Equilibrium Games [1.060608983034705]
This paper focuses on the capabilities of Large Language Models to find the Nash equilibrium in games with a mixed strategy Nash equilibrium and no pure strategy Nash equilibrium.
The study reveals a significant enhancement in the performance of LLMs when they are equipped with the possibility to run code.
It is evident that while LLMs exhibit remarkable proficiency in well-known standard games, their performance dwindles when faced with slight modifications of the same games.
arXiv Detail & Related papers (2024-06-15T09:30:20Z) - Aligners: Decoupling LLMs and Alignment [47.00002038331952]
Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications.
We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis.
arXiv Detail & Related papers (2024-03-07T04:54:56Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - A Minimaximalist Approach to Reinforcement Learning from Human Feedback [49.45285664482369]
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback.
Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training.
We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches.
arXiv Detail & Related papers (2024-01-08T17:55:02Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.