h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
- URL: http://arxiv.org/abs/2510.07312v2
- Date: Wed, 15 Oct 2025 17:57:26 GMT
- Title: h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
- Authors: Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, Charles London,
- Abstract summary: Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase.<n>Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision.<n>We introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data.
- Score: 22.930073904843212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. It also transfers significantly to diverse out-of-distribution ReasoningGym domains and long-context benchmarks, indicating broader generalization. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.
Related papers
- KLong: Training LLM Agent for Extremely Long-horizon Tasks [58.68395081637727]
This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks.<n>We first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training.<n> Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench.
arXiv Detail & Related papers (2026-02-19T17:01:08Z) - JustRL: Scaling a 1.5B LLM with a Simple RL Recipe [45.42398283391072]
Single-stage training achieves state-of-the-art performance on two 1.5B reasoning models.<n>Training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions.
arXiv Detail & Related papers (2025-12-18T15:21:25Z) - RL for Reasoning by Adaptively Revealing Rationales [36.50924054394857]
Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows.<n>We address this by adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training.<n>We show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable.
arXiv Detail & Related papers (2025-06-22T17:46:14Z) - Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning [52.32193550674408]
We aim to improve the reasoning capabilities of language models via reinforcement learning (RL)<n>We propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually.<n>E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B)
arXiv Detail & Related papers (2025-06-07T02:41:54Z) - Horizon Reduction Makes RL Scalable [78.67071359991218]
We study the scalability of offline reinforcement learning (RL) algorithms.<n>We use datasets up to 1000x larger than typical offline RL datasets.<n>We show that horizon is the main cause behind the poor scaling of offline RL.
arXiv Detail & Related papers (2025-06-04T17:06:54Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs)<n>StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks.<n> Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - DHRL: A Graph-Based Approach for Long-Horizon and Sparse Hierarchical
Reinforcement Learning [26.973783464706447]
Hierarchical Reinforcement Learning (HRL) has made notable progress in complex control tasks by leveraging temporal abstraction.
Previous HRL algorithms often suffer from serious data inefficiency as environments get large.
We present a method of Decoupling Horizons Using a Graph in Hierarchical Reinforcement Learning (DHRL) which can alleviate this problem.
arXiv Detail & Related papers (2022-10-11T05:09:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.