Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision
- URL: http://arxiv.org/abs/2602.00927v1
- Date: Sat, 31 Jan 2026 22:54:45 GMT
- Title: Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision
- Authors: Yihao Xue, Allan Zhang, Jianhao Huang, Amit Sahai, Baharan Mirzasoleiman,
- Abstract summary: Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models.<n>Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence.<n>Under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length increases.
- Score: 30.75583081407994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution's reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.
Related papers
- Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning [37.40951956513094]
Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation.<n>Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR)<n>PIR transforms LRMs'reasoning paradigm from exhaustive self-exploration to guided learning from precedents.
arXiv Detail & Related papers (2026-02-16T04:17:46Z) - Constraint-Rectified Training for Efficient Chain-of-Thought [60.52883907721588]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking.<n>Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy.
arXiv Detail & Related papers (2026-02-13T02:13:45Z) - Does Your Reasoning Model Implicitly Know When to Stop Thinking? [45.954548163594204]
We show that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms.<n>Motivated by this, we introduce SAGE, a novel sampling paradigm that unleashes this efficient reasoning potential.
arXiv Detail & Related papers (2026-02-09T07:38:22Z) - Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models [29.56923793047279]
We introduce Dynamic Outlier Truncation (DOT), a training-time intervention that selectively suppresses redundant tokens.<n>DOT targets only the extreme tail of response lengths within fully correct rollout groups while preserving long-horizon reasoning capabilities.<n>Our method reduces inference token usage by 78% while simultaneously increasing accuracy compared to the initial policy.
arXiv Detail & Related papers (2026-01-07T14:31:07Z) - Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z) - LaSeR: Reinforcement Learning with Last-Token Self-Rewarding [54.72617309922891]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>Previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency.<n>We propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss.
arXiv Detail & Related papers (2025-10-16T17:55:11Z) - Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought [64.43689151961054]
We theoretically analyze the training dynamics of a simplified two-layer transformer on the directed graph reachability problem.<n>Our analysis reveals that during training using continuous thought, the index-matching logit will first increase and then remain bounded under mild assumptions.
arXiv Detail & Related papers (2025-09-27T15:23:46Z) - RL for Reasoning by Adaptively Revealing Rationales [36.50924054394857]
Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows.<n>We address this by adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training.<n>We show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable.
arXiv Detail & Related papers (2025-06-22T17:46:14Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Demystifying Long Chain-of-Thought Reasoning in LLMs [46.352406501403465]
Long chains-of-thought (CoTs) enable strategies like backtracking and error correction.<n>Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities.<n>We identify the key factors that enable models to generate long CoT trajectories.
arXiv Detail & Related papers (2025-02-05T17:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.