Towards Understanding Self-play for LLM Reasoning
- URL: http://arxiv.org/abs/2510.27072v1
- Date: Fri, 31 Oct 2025 00:41:37 GMT
- Title: Towards Understanding Self-play for LLM Reasoning
- Authors: Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi,
- Abstract summary: We analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner.<n>Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions.
- Score: 3.058685580689604
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.
Related papers
- Reasoning Promotes Robustness in Theory of Mind Tasks [0.26945563448932225]
Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests.<n>This paper examines the behavior of such reasoning models in ToM tasks using novel adaptations of machine psychological experiments and results from established benchmarks.
arXiv Detail & Related papers (2026-01-23T16:01:24Z) - Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z) - Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap [11.709132975874638]
We theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap.<n>We extend our analysis to investigate how external data influences these dynamics within the framework.
arXiv Detail & Related papers (2025-06-29T06:48:47Z) - A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models [103.88578274567784]
Motivation-enhanced Reinforcement Finetuning (MeRF) is an intuitive yet effective method enhancing reinforcement finetuning of Large Reasoning Models.<n>MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective.<n>MeRF achieves substantial performance gains over RLVR baseline.
arXiv Detail & Related papers (2025-06-23T10:37:57Z) - No Free Lunch: Rethinking Internal Feedback for LLM Reasoning [12.881043910316787]
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning.<n>We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards.
arXiv Detail & Related papers (2025-06-20T17:59:52Z) - Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z) - Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Superficial Self-Improved Reasoners Benefit from Model Merging [49.09091498084467]
Self-improvement as a solution to synthesizing high-quality data corpus.<n>In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities.<n>We propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization.
arXiv Detail & Related papers (2025-03-03T22:41:25Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z) - SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models.
It enables LLMs to self-improve through self-reflection, akin to human learning processes.
Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.