Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
- URL: http://arxiv.org/abs/2412.09413v2
- Date: Sun, 22 Dec 2024 10:44:13 GMT
- Title: Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems
- Authors: Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen,
- Abstract summary: o1-like reasoning systems have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>We introduce an imitate, explore, and self-improve'' framework to train the reasoning model.<n>Our approach achieves competitive performance compared to industry-level reasoning systems.
- Score: 92.89673285398521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, slow-thinking reasoning systems, such as o1, have demonstrated remarkable capabilities in solving complex reasoning tasks. These systems typically engage in an extended thinking process before responding to a query, allowing them to generate more thorough, accurate, and well-reasoned solutions. These systems are primarily developed and maintained by industry, with their core techniques not publicly disclosed. In response, an increasing number of studies from the research community aim to explore the technical foundations underlying these powerful reasoning systems. Building on these prior efforts, this paper presents a reproduction report on implementing o1-like reasoning systems. We introduce an ``imitate, explore, and self-improve'' framework, denoted as \textbf{STILL-2}, as our primary technical approach to train the reasoning model. In the initial phase, we use distilled long-form thought data to fine-tune the reasoning model, enabling it to invoke a slow-thinking mode. The model is then encouraged to explore challenging problems by generating multiple rollouts, which can result in increasingly more high-quality trajectories that lead to correct answers. Furthermore, the model undergoes self-improvement by iteratively refining its training dataset. To verify the effectiveness of this approach, we conduct extensive experiments on three challenging benchmarks. The experimental results demonstrate that our approach achieves competitive performance compared to industry-level reasoning systems on these benchmarks.
Related papers
- LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception [105.78609483419115]
We introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks.
We propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions.
We demonstrate notable improvements over existing visual reasoning data-generation methods.
arXiv Detail & Related papers (2025-04-21T18:10:38Z) - Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models.
We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z) - A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.5807076505261]
Large Reasoning Models (LRMs) have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference.
A growing concern lies in their tendency to produce excessively long reasoning traces.
This inefficiency introduces significant challenges for training, inference, and real-world deployment.
arXiv Detail & Related papers (2025-03-27T15:36:30Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [54.04678363287392]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.
Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.
Models may behave unreliably due to poorly explored failure modes.
causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [18.960920426485163]
Self-improvement has emerged as a primary method for enhancing performance.
We identify and propose methods to monitor two pivotal factors in this iterative process.
We introduce B-STaR, a Self-Taught Reasoning framework that adjusts configurations across iterations to balance exploration and exploitation.
arXiv Detail & Related papers (2024-12-23T03:58:34Z) - REL: Working out is all you need [20.65423513616306]
We observe that OpenAI's O1 model approaches problem-solving in a distinctly human-like manner.<n>These sophisticated reasoning capabilities remain notably absent in other state-of-the-art language models.
arXiv Detail & Related papers (2024-12-05T22:32:01Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z) - Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training [49.3242278912771]
Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions.
Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework.
We propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process.
arXiv Detail & Related papers (2023-11-23T17:09:48Z) - QAGCN: Answering Multi-Relation Questions via Single-Step Implicit Reasoning over Knowledge Graphs [12.354648004427824]
Multi-relation question answering (QA) is a challenging task.
Recent methods with explicit multi-step reasoning over KGs have been prominently used in this task.
We argue that multi-relation QA can be achieved via end-to-end single-step implicit reasoning.
arXiv Detail & Related papers (2022-06-03T21:01:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.