Can Large Reasoning Models Self-Train?
- URL: http://arxiv.org/abs/2505.21444v2
- Date: Wed, 08 Oct 2025 22:32:56 GMT
- Title: Can Large Reasoning Models Self-Train?
- Authors: Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette,
- Abstract summary: We use majority voting as a simple self-feedback mechanism to study whether self-training can be sustained within reinforcement learning.<n>We find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration.<n>Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking, resulting in sudden and complete performance collapse.
- Score: 51.0277533541394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent successes of reinforcement learning (RL) in training large reasoning models motivate the question of whether self-training - the process where a model learns from its own judgments - can be sustained within RL. In this work, we study this question using majority voting as a simple self-feedback mechanism. On a comprehensive set of experiments on both synthetic and real reasoning tasks, we find that this basic approach improves not only the model's reasoning performance, but also its capability of generating better quality feedback for the next RL iteration, driving further model improvement. Yet our analysis also reveals a critical limitation of such a self-training paradigm - prolonged RL with self-reward leads to reward hacking where models learn to maximize training (pseudo-)reward, resulting in sudden and complete performance collapse. Together, these results highlight feedback design as the central challenge and call for future research on mechanisms to enable prolonged self-improvement.
Related papers
- Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing [70.35701681177655]
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models.<n>We introduce four efficient strategies to achieve head-tail re-balancing during the exploration-and-learning self-improvement process.<n>Our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
arXiv Detail & Related papers (2025-10-30T13:26:58Z) - Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions [17.407689582427437]
Large language model (LLM) reasoning has shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL)<n>We introduce a novel training approach, textbfReLIFT (textbfReinforcement textbfL textbfInterleaved with Online textbfFine-textbfTuning)<n>In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternate
arXiv Detail & Related papers (2025-06-09T08:11:20Z) - Incentivizing LLMs to Self-Verify Their Answers [20.2584779107763]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.<n>We propose a framework that incentivizes LLMs to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.
arXiv Detail & Related papers (2025-06-02T06:54:29Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Self Rewarding Self Improving [0.0]
We demonstrate that large language models can effectively self-improve through self-judging without requiring reference solutions.<n>Our experiments on Countdown puzzles and MIT Integration Bee problems show that models can provide reliable reward signals without ground truth answers.
arXiv Detail & Related papers (2025-05-12T23:51:04Z) - An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.52239241349504]
scaling RL training has become a central technique for implementing such reasoning models.<n>We demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models.<n>We also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models.
arXiv Detail & Related papers (2025-03-06T15:34:27Z) - Self-rewarding correction for mathematical reasoning [19.480508580498103]
We study self-rewarding reasoning large language models (LLMs)<n>LLMs can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback.<n>We propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data.
arXiv Detail & Related papers (2025-02-26T23:01:16Z) - ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.<n>Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z) - Evolving LLMs' Self-Refinement Capability via Synergistic Training-Inference Optimization [53.93621974137829]
Self-Refinement refers to a model's ability to revise its own responses to produce improved outputs.<n>EVOLVE is a framework for eliciting and tracking the evolution of Self-Refinement through iterative training.<n>We demonstrate the potential of leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic model abilities.
arXiv Detail & Related papers (2025-02-08T15:21:55Z) - Diving into Self-Evolving Training for Multimodal Reasoning [36.70979791148913]
Self-evolving trainin has emerged as a key approach for complex reasoning tasks.<n>This paper reframes self-evolving training for multimodal reasoning through the lens of reinforcement learning.<n>We propose M-STAR, a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks.
arXiv Detail & Related papers (2024-12-23T10:18:41Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.<n>We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.<n>We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs)
We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data.
We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z) - Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.