Related papers: Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies

URL: http://arxiv.org/abs/2410.03968v1
Date: Fri, 4 Oct 2024 23:18:27 GMT
Title: Decoding Game: On Minimax Optimality of Heuristic Text Generation Strategies
Authors: Sijin Chen, Omar Hagrass, Jason M. Klusowski,
Abstract summary: We propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist and Nature. It is shown that the adversarial Nature imposes an implicit regularization on likelihood, and truncation-normalization methods are first order approximations to the optimal strategy under this regularization.
Score: 7.641996822987559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decoding strategies play a pivotal role in text generation for modern language models, yet a puzzling gap divides theory and practice. Surprisingly, strategies that should intuitively be optimal, such as Maximum a Posteriori (MAP), often perform poorly in practice. Meanwhile, popular heuristic approaches like Top-$k$ and Nucleus sampling, which employ truncation and normalization of the conditional next-token probabilities, have achieved great empirical success but lack theoretical justifications. In this paper, we propose Decoding Game, a comprehensive theoretical framework which reimagines text generation as a two-player zero-sum game between Strategist, who seeks to produce text credible in the true distribution, and Nature, who distorts the true distribution adversarially. After discussing the decomposibility of multi-step generation, we derive the optimal strategy in closed form for one-step Decoding Game. It is shown that the adversarial Nature imposes an implicit regularization on likelihood maximization, and truncation-normalization methods are first-order approximations to the optimal strategy under this regularization. Additionally, by generalizing the objective and parameters of Decoding Game, near-optimal strategies encompass diverse methods such as greedy search, temperature scaling, and hybrids thereof. Numerical experiments are conducted to complement our theoretical analysis.

Related papers

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z)
Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models [0.0]
We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory. We analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one. Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion.
arXiv Detail & Related papers (2025-03-27T19:15:43Z)
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a. We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
arXiv Detail & Related papers (2025-02-18T09:33:48Z)
Preference-based opponent shaping in differentiable games [3.373994463906893]
We propose a novel Preference-based Opponent Shaping (PBOS) method to enhance the strategy learning process by shaping agents' preferences towards cooperation. We verify the performance of PBOS algorithm in a variety of differentiable games.
arXiv Detail & Related papers (2024-12-04T06:49:21Z)
e-COP : Episodic Constrained Optimization of Policies [12.854752753529151]
We present the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. We show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting.
arXiv Detail & Related papers (2024-06-13T20:12:09Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)
Stackelberg Batch Policy Learning [3.5426153040167754]
Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration. Worst-case optimality algorithms, which calibrate a value-function model class from logged experience, have emerged as a promising paradigm for batch RL. We propose a novel gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient.
arXiv Detail & Related papers (2023-09-28T06:18:34Z)
Sampling-based Reactive Synthesis for Nondeterministic Hybrid Systems [20.0212772540119]
This paper introduces a sampling-based strategy synthesis algorithm for nondeterministic hybrid systems. We model the evolution of the hybrid system as a two-player game, where the nondeterminism is an adversarial player. The aim is to synthesize a winning strategy -- a reactive (robust) strategy that guarantees the satisfaction of the goals under all possible moves of the adversarial player.
arXiv Detail & Related papers (2023-04-14T00:45:16Z)
Fictitious Play with Maximin Initialization [1.7132914341329848]
Fictitious play is the most accurate scalable algorithm for Nash equilibrium strategies in multiplayer games. We show that the equilibrium approximation error of fictitious play can be significantly reduced by carefully the degree of initial strategies.
arXiv Detail & Related papers (2022-03-21T07:34:20Z)
Universal Online Convex Optimization Meets Second-order Bounds [74.0120666722487]
We propose a simple strategy for universal online convex optimization. The key idea is to construct a set of experts to process the original online functions, and deploy a meta-algorithm over the linearized losses. In this way, we can plug in off-the-shelf online solvers as black-box experts to deliver problem-dependent regret bounds.
arXiv Detail & Related papers (2021-05-08T11:43:49Z)
Portfolio Search and Optimization for General Strategy Game-Playing [58.896302717975445]
We propose a new algorithm for optimization and action-selection based on the Rolling Horizon Evolutionary Algorithm. For the optimization of the agents' parameters and portfolio sets we study the use of the N-tuple Bandit Evolutionary Algorithm. An analysis of the agents' performance shows that the proposed algorithm generalizes well to all game-modes and is able to outperform other portfolio methods.
arXiv Detail & Related papers (2021-04-21T09:28:28Z)
On the Impossibility of Convergence of Mixed Strategies with No Regret Learning [10.515544361834241]
We study convergence properties of the mixed strategies that result from a general class of optimal no regret learning strategies. We consider the class of strategies whose information set at each step is the empirical average of the opponent's realized play.
arXiv Detail & Related papers (2020-12-03T18:02:40Z)
Faster Algorithms for Optimal Ex-Ante Coordinated Collusive Strategies in Extensive-Form Zero-Sum Games [123.76716667704625]
We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfect-information zero-sum extensive-form game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. We provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile.
arXiv Detail & Related papers (2020-09-21T17:51:57Z)
Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles [112.89548995091182]
We provide the first universal and optimal reduction from contextual bandits to online regression. Our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.
arXiv Detail & Related papers (2020-02-12T11:33:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.