Related papers: RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?

URL: http://arxiv.org/abs/2501.11284v1
Date: Mon, 20 Jan 2025 05:44:01 GMT
Title: RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?
Authors: Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang,
Abstract summary: We explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar.<n>Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT.<n>RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME) it solves 46.7% of problems using only 21k mixed-code-math datasets.
Score: 40.575978129688586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.

Related papers

Efficient Reasoning for LLMs through Speculative Chain-of-Thought [44.76494056102963]
Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have attracted widespread attention due to their impressive task-solving abilities. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. We introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed.
arXiv Detail & Related papers (2025-04-27T03:56:39Z)
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset. We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard) We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z)
Long Is More Important Than Difficult for Training Reasoning Models [21.369780872368143]
We show that reasoning length, rather than problem difficulty, primarily influences the performance of trained models. We present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples.
arXiv Detail & Related papers (2025-03-23T13:33:59Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA) With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs) However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy? In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation [88.77999917897702]
o1 from OpenAI has demonstrated remarkable reasoning capabilities. Many teams have attempted to replicate its LongCoT and reasoning capabilities. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations.
arXiv Detail & Related papers (2025-02-06T08:19:59Z)
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search [67.35240346713911]
We take the first step towards evaluating large language models (LLMs) in the long-tail distribution of inferential knowledge. Link is a systematic long-tail data generation framework, to obtain factually-correct yet long-tail inferential statements. We then use LINK to curate Logic-Induced-Long-Tail (LINT), a large-scale long-tail inferential knowledge dataset.
arXiv Detail & Related papers (2023-11-13T10:56:59Z)
Testing RadiX-Nets: Advances in Viable Sparse Topologies [0.9555447998395205]
Sparsification of hyper-parametrized deep neural networks (DNNs) creates simpler representations of complex data. RadiX-Nets, a subgroup of DNNs, maintain runtime which counteracts their lack of neural connections. This paper presents a testing suite for RadiX-Nets in scalable models.
arXiv Detail & Related papers (2023-11-06T23:27:28Z)
Detach-ROCKET: Sequential feature selection for time series classification with random convolutional kernels [0.7499722271664144]
We introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models. SFD can produce models with better test accuracy using only 10% of the original features. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy.
arXiv Detail & Related papers (2023-09-25T20:24:36Z)
Dynamic Query Selection for Fast Visual Perceiver [42.07082299370995]
We show how to make Perceivers even more efficient, by reducing the number of queries Q during inference while limiting the accuracy drop. In this work, we explore how to make Perceivers even more efficient, by reducing the number of queries Q during inference while limiting the accuracy drop.
arXiv Detail & Related papers (2022-05-22T17:23:51Z)
Long-tailed Recognition by Routing Diverse Distribution-Aware Experts [64.71102030006422]
We propose a new long-tailed classifier called RoutIng Diverse Experts (RIDE) It reduces the model variance with multiple experts, reduces the model bias with a distribution-aware diversity loss, reduces the computational cost with a dynamic expert routing module. RIDE outperforms the state-of-the-art by 5% to 7% on CIFAR100-LT, ImageNet-LT and iNaturalist 2018 benchmarks.
arXiv Detail & Related papers (2020-10-05T06:53:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.