Related papers: Towards Understanding Self-play for LLM Reasoning

Towards Understanding Self-play for LLM Reasoning

URL: http://arxiv.org/abs/2510.27072v1
Date: Fri, 31 Oct 2025 00:41:37 GMT
Title: Towards Understanding Self-play for LLM Reasoning
Authors: Justin Yang Chae, Md Tanvirul Alam, Nidhi Rastogi,
Abstract summary: We analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner.<n>Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions.
Score: 3.058685580689604
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.

Related papers

Reasoning Promotes Robustness in Theory of Mind Tasks [0.26945563448932225]
Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests.<n>This paper examines the behavior of such reasoning models in ToM tasks using novel adaptations of machine psychological experiments and results from established benchmarks.
arXiv Detail & Related papers (2026-01-23T16:01:24Z)
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z)
Theoretical Modeling of LLM Self-Improvement Training Dynamics Through Solver-Verifier Gap [11.709132975874638]
We theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap.<n>We extend our analysis to investigate how external data influences these dynamics within the framework.
arXiv Detail & Related papers (2025-06-29T06:48:47Z)
A Simple "Motivation" Can Enhance Reinforcement Finetuning of Large Reasoning Models [103.88578274567784]
Motivation-enhanced Reinforcement Finetuning (MeRF) is an intuitive yet effective method enhancing reinforcement finetuning of Large Reasoning Models.<n>MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for the model to be aware of the optimization objective.<n>MeRF achieves substantial performance gains over RLVR baseline.
arXiv Detail & Related papers (2025-06-23T10:37:57Z)
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning [12.881043910316787]
Reinforcement learning has emerged as a powerful paradigm for post-training large language models (LLMs) to improve reasoning.<n>We investigate an alternative class of methods, Reinforcement Learning from Internal Feedback (RLIF), which relies solely on intrinsic model-derived signals instead of external rewards.
arXiv Detail & Related papers (2025-06-20T17:59:52Z)
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning [87.7836502955847]
We propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning.<n>Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood.<n>We introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy.
arXiv Detail & Related papers (2025-06-10T12:40:39Z)
Can Large Reasoning Models Self-Train? [51.0277533541394]
We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
arXiv Detail & Related papers (2025-05-27T17:16:00Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
Superficial Self-Improved Reasoners Benefit from Model Merging [49.09091498084467]
Self-improvement as a solution to synthesizing high-quality data corpus.<n>In particular, our analysis reveals that even when LMs show improved in-domain (ID) reasoning accuracy, they actually compromise their generalized reasoning capabilities.<n>We propose Iterative Model Merging (IMM), a method that strategically combines weights from original and self-improved models to preserve generalization.
arXiv Detail & Related papers (2025-03-03T22:41:25Z)
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z)
SELF: Self-Evolution with Language Feedback [68.6673019284853]
'SELF' (Self-Evolution with Language Feedback) is a novel approach to advance large language models. It enables LLMs to self-improve through self-reflection, akin to human learning processes. Our experiments in mathematics and general tasks demonstrate that SELF can enhance the capabilities of LLMs without human intervention.
arXiv Detail & Related papers (2023-10-01T00:52:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.