Related papers: Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

URL: http://arxiv.org/abs/2601.20829v1
Date: Wed, 28 Jan 2026 18:29:21 GMT
Title: Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
Authors: Minwu Kim, Safal Shrestha, Keith Ross,
Abstract summary: We propose failure- conditioning, a simple and effective method for learning from saturated problems.<n>We observe that failure-prone conditioning yields performance gains matching those of training on medium-difficulty problems.<n>Our results suggest that failure- conditioning offers an effective pathway to extend RLVR training on saturated problems.
Score: 0.3823356975862005
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.

Related papers

Learning Robust Reasoning through Guided Adversarial Self-Play [32.87933476043378]
We introduce GASP (Guided Adrial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities.<n>Without human labels or external teachers, GASP forms an adversarial self-play game within a single model.<n>In-distribution repair guidance, an imitation term on self-generated repairs, increases recovery probability while preserving previously acquired capabilities.
arXiv Detail & Related papers (2026-01-30T02:23:31Z)
InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning [32.274434679047395]
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs)<n>Standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect.<n>We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces.
arXiv Detail & Related papers (2026-01-20T18:15:38Z)
Forget Less, Retain More: A Lightweight Regularizer for Rehearsal-Based Continual Learning [51.07663354001582]
Deep neural networks suffer from catastrophic forgetting, where performance on previous tasks degrades after training on a new task.<n>We present a novel approach to address this challenge, focusing on the intersection of memory-based methods and regularization approaches.<n>We formulate a regularization strategy, termed Information Maximization (IM) regularizer, for memory-based continual learning methods.
arXiv Detail & Related papers (2025-12-01T15:56:00Z)
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z)
HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness [49.72591739116668]
Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs)<n>However, prevalent methods like GRPO often fail when task difficulty exceeds the model's capacity, leading to reward sparsity and inefficient training.<n>We propose HINT: Helping Ineffective rollouts Navigate Towards effectiveness, an adaptive hinting framework.
arXiv Detail & Related papers (2025-10-10T13:42:03Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Progress or Regress? Self-Improvement Reversal in Post-training [26.051637877066327]
We propose a comprehensive evaluative framework to scrutinize the underlying enhancements of post-training paradigms for self-improvement. We show that models showing improved performance across benchmarks will paradoxically exhibit declines in broader, essential capabilities. These findings indicate that current self-improvement practices through post-training are inadequate for equipping models to tackle more complex problems.
arXiv Detail & Related papers (2024-07-06T09:07:11Z)
Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem [12.185261182744377]
This work conceptualizes one specific cause of poor transfer, accentuated in the reinforcement learning setting. A model deteriorates on the state subspace of the downstream task not visited in the initial phase of fine-tuning. We show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities.
arXiv Detail & Related papers (2024-02-05T10:30:47Z)
A Reusable AI-Enabled Defect Detection System for Railway Using Ensembled CNN [5.381374943525773]
Defect detection is crucial for ensuring the trustworthiness of railway systems. Current approaches rely on single deep-learning models, like CNNs. We propose a reusable AI-enabled defect detection approach.
arXiv Detail & Related papers (2023-11-24T19:45:55Z)
Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information. We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting. Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z)
NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data. The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.