Related papers: Long Is More Important Than Difficult for Training Reasoning Models

Long Is More Important Than Difficult for Training Reasoning Models

URL: http://arxiv.org/abs/2503.18069v1
Date: Sun, 23 Mar 2025 13:33:59 GMT
Title: Long Is More Important Than Difficult for Training Reasoning Models
Authors: Si Shen, Fei Huang, Zhixiao Zhao, Chang Liu, Tiansheng Zheng, Danhao Zhu,
Abstract summary: We show that reasoning length, rather than problem difficulty, primarily influences the performance of trained models.<n>We present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples.
Score: 21.369780872368143
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Difficult problems, which often result in long reasoning traces, are widely recognized as key factors for enhancing the performance of reasoning models. However, such high-challenge problems are scarce, limiting the size of available datasets. In this paper, we propose a simple method to decouple the reliance on problem difficulty. First, we empirically demonstrate that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. Second, we identify a scaling law on reasoning length, showing that model performance increases in a log-linear fashion as the reasoning data length grows. Finally, we introduce a straightforward technique to generate reasoning data of arbitrary length, and show that synthesized data is effective for training reasoning models. After fine-tuning the Qwen2.5-32B-Instruct language model on our Long1K dataset, we present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples, achieving 95.6\% accuracy on MATH, and 71.1\% on GPQA outperforming DeepSeek-R1-Distill-Qwen-32B. The model, code, and dataset are all open-sourced, available at https://huggingface.co/ZTss/LONG1.

Related papers

SplitReason: Learning To Offload Reasoning [7.016347390223799]
Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. We leverage this by offloading only the most challenging parts of the reasoning process to a larger, more capable model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively.
arXiv Detail & Related papers (2025-04-23T03:00:02Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model [70.77691645678804]
We present the first successful replication of emergent characteristics for multimodal reasoning on only a non-SFT 2B model.<n>Our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately 30% and exceeding both SFT setting by 2%.<n>In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models.
arXiv Detail & Related papers (2025-03-07T04:21:47Z)
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)<n>We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)<n>With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z)
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation [88.77999917897702]
o1 from OpenAI has demonstrated remarkable reasoning capabilities. Many teams have attempted to replicate its LongCoT and reasoning capabilities. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations.
arXiv Detail & Related papers (2025-02-06T08:19:59Z)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z)
Towards Scalable and Deep Graph Neural Networks via Noise Masking [59.058558158296265]
Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. scaling them to large graphs is challenging due to the high computational and storage costs. We present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works.
arXiv Detail & Related papers (2024-12-19T07:48:14Z)
HiPool: Modeling Long Documents Using Graph Neural Networks [24.91040673099863]
Long sequences in Natural Language Processing (NLP) are a challenging problem. Recent pretraining language models achieve satisfying performances in many NLP tasks. We propose a new challenging benchmark, totaling six datasets with up to 53k samples and 4034 average tokens' length.
arXiv Detail & Related papers (2023-05-05T06:58:24Z)
A Simple and Interpretable Predictive Model for Healthcare [0.0]
Deep learning models are currently dominating most state-of-the-art solutions for disease prediction. These deep learning models, with trainable parameters running into millions, require huge amounts of compute and data to train and deploy. We develop a simpler yet interpretable non-deep learning based model for application to EHR data.
arXiv Detail & Related papers (2020-07-27T08:13:37Z)
Learning Interpretable Models Using Uncertainty Oracles [12.879371384378164]
A desirable property of interpretable models is small size, so that they are easily understandable by humans. This leads to the following challenges: (a) small sizes imply diminished accuracy, and (b) bespoke levers provided by model families to restrict size might be insufficient to reach the desired size-accuracy trade-off.
arXiv Detail & Related papers (2019-06-17T05:53:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.