Related papers: One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

URL: http://arxiv.org/abs/2506.02338v1
Date: Tue, 03 Jun 2025 00:29:15 GMT
Title: One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL
Authors: Hyungjoo Chae, Dongjin Kang, Jihyuk Kim, Beong-woo Kwak, Sunghyun Park, Haeju Park, Jinyoung Yeo, Moontae Lee, Kyungjae Lee,
Abstract summary: We present the Long CoT Collection, a dataset of 100K CoT rationales using existing short CoT LLMs.<n>We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer.<n>Our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models on our data achieve 2-3x larger gains with RLVR.
Score: 23.05557667879586
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

Related papers

SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Optimizing Length Compression in Large Reasoning Models [15.730667464815548]
Large Reasoning Models (LRMs) often suffer from producing unnecessary and verbose reasoning chains.<n>We propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved.<n> LC-R1 employs a novel combination of a Reward Length for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process.
arXiv Detail & Related papers (2025-06-17T17:50:16Z)
Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z)
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning [80.26953590563232]
We formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process.<n>We propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling.<n> Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B.
arXiv Detail & Related papers (2025-05-23T09:31:55Z)
Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning [3.364797975300393]
We present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n>We construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training.<n>Experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks.
arXiv Detail & Related papers (2025-05-18T14:08:03Z)
100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models [58.98176123850354]
The recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models.<n>The implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models.<n>Many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources.
arXiv Detail & Related papers (2025-05-01T14:28:35Z)
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [24.45348222168512]
We propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.<n>Our model achieves an average improvement of $sim$6% across various multimodal math reasoning benchmarks.<n>Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1.
arXiv Detail & Related papers (2025-03-09T20:06:45Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models [39.22557129190619]
Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models.<n>To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search.<n>We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the constructed data.
arXiv Detail & Related papers (2025-03-03T12:17:36Z)
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation [88.77999917897702]
o1 from OpenAI has demonstrated remarkable reasoning capabilities.<n>Many teams have attempted to replicate its LongCoT and reasoning capabilities.<n>This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations.
arXiv Detail & Related papers (2025-02-06T08:19:59Z)
T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.