Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning
- URL: http://arxiv.org/abs/2602.01357v1
- Date: Sun, 01 Feb 2026 17:50:59 GMT
- Title: Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning
- Authors: Shangzhe Li, Xuchao Zhang, Chetan Bansal, Weitong Zhang,
- Abstract summary: We propose a new self-play imitation finetuning algorithm based on the $2$-divergence variational objective with bounded rewards and improved stability.<n> Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods.
- Score: 29.069448000317035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it's equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the $χ^2$-divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.
Related papers
- Self-Generative Adversarial Fine-Tuning for Large Language Models [34.82368594497859]
Fine-tuning large language models (LLMs) for alignment typically relies on supervised fine-tuning or reinforcement learning from human feedback.<n>Recent self-play and synthetic data approaches reduce this dependence but often rely on assumptions or ungrounded self-evaluation.<n>We propose Self-Generative Adversarial LLM (SGALM), a unified fine-tuning framework that formulates alignment as a generative adversarial game.
arXiv Detail & Related papers (2026-02-01T10:20:27Z) - Towards Understanding Self-play for LLM Reasoning [3.058685580689604]
We analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner.<n>Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions.
arXiv Detail & Related papers (2025-10-31T00:41:37Z) - Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning [65.85335291827086]
This paper tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints.<n>We pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos.<n>For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model.
arXiv Detail & Related papers (2025-03-11T13:50:22Z) - Self-rewarding correction for mathematical reasoning [19.480508580498103]
We study self-rewarding reasoning large language models (LLMs)<n>LLMs can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback.<n>We propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data.
arXiv Detail & Related papers (2025-02-26T23:01:16Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation [71.91287418249688]
Large language models (LLMs) often struggle with factual inaccuracies, even when they hold relevant knowledge.
We leverage the self-evaluation capability of an LLM to provide training signals that steer the model towards factuality.
We show that the proposed self-alignment approach substantially enhances factual accuracy over Llama family models across three key knowledge-intensive tasks.
arXiv Detail & Related papers (2024-02-14T15:52:42Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.<n>We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Control as Hybrid Inference [62.997667081978825]
We present an implementation of CHI which naturally mediates the balance between iterative and amortised inference.
We verify the scalability of our algorithm on a continuous control benchmark, demonstrating that it outperforms strong model-free and model-based baselines.
arXiv Detail & Related papers (2020-07-11T19:44:09Z) - A Comparison of Self-Play Algorithms Under a Generalized Framework [4.339542790745868]
The notion of self-play, albeit often cited in multiagent Reinforcement Learning, has never been grounded in a formal model.
We present a formalized framework, with clearly defined assumptions, which encapsulates the meaning of self-play.
We measure how well a subset of the captured self-play methods approximate this solution when paired with the famous PPO algorithm.
arXiv Detail & Related papers (2020-06-08T11:02:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.