RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?
- URL: http://arxiv.org/abs/2501.11284v1
- Date: Mon, 20 Jan 2025 05:44:01 GMT
- Title: RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?
- Authors: Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang,
- Abstract summary: We explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar.
Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT.
RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2% to 81.6%, and on the USA Math Olympiad (AIME) it solves 46.7% of problems using only 21k mixed-code-math datasets.
- Score: 40.575978129688586
- License:
- Abstract: Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.
Related papers
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? [61.85289698610747]
We study whether o1-like large language models (LLMs) truly possess test-time scaling capabilities.
We find that longer CoTs of these o1-like models do not consistently enhance accuracy.
We propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics.
arXiv Detail & Related papers (2025-02-17T07:21:11Z) - LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)
We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)
With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)
However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?
In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation [88.77999917897702]
o1 from OpenAI has demonstrated remarkable reasoning capabilities.
Many teams have attempted to replicate its LongCoT and reasoning capabilities.
This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations.
arXiv Detail & Related papers (2025-02-06T08:19:59Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Testing RadiX-Nets: Advances in Viable Sparse Topologies [0.9555447998395205]
Sparsification of hyper-parametrized deep neural networks (DNNs) creates simpler representations of complex data.
RadiX-Nets, a subgroup of DNNs, maintain runtime which counteracts their lack of neural connections.
This paper presents a testing suite for RadiX-Nets in scalable models.
arXiv Detail & Related papers (2023-11-06T23:27:28Z) - Dynamic Query Selection for Fast Visual Perceiver [42.07082299370995]
We show how to make Perceivers even more efficient, by reducing the number of queries Q during inference while limiting the accuracy drop.
In this work, we explore how to make Perceivers even more efficient, by reducing the number of queries Q during inference while limiting the accuracy drop.
arXiv Detail & Related papers (2022-05-22T17:23:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.