Related papers: Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

URL: http://arxiv.org/abs/2509.21124v2
Date: Fri, 26 Sep 2025 03:10:45 GMT
Title: Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Authors: Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai,
Abstract summary: We define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question.<n>We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential.<n>We show that only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025.
Score: 34.16978953994544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

Related papers

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability [129.1296673737603]
Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning.<n>A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution.<n>We propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity.
arXiv Detail & Related papers (2026-02-02T18:54:54Z)
Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts [19.518525241726916]
Encode-Think-Decode (ETD) is a method that enhances the reasoning capabilities of a base model by training it to iterate over a small subset of reasoning-relevant layers during the mid-training stage.<n>ETD models yield substantial gains on 17 reasoning benchmarks, including +28.4% relative accuracy improvement on GSM8K and +36% on MATH with the OLMo-2 1B Base model.
arXiv Detail & Related papers (2025-10-08T15:58:35Z)
Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning [33.30315111732609]
Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities.<n>However, its reliability is often undermined by the accumulation of errors in intermediate steps.<n>This paper introduces an approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding.
arXiv Detail & Related papers (2025-07-14T07:41:35Z)
The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [31.973226821366325]
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought traces.<n>We ask whether long CoT can be induced in a base model using only prompting or minimal tuning.<n>The resulting model outperforms the much larger textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities.
arXiv Detail & Related papers (2025-07-14T01:14:50Z)
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z)
Interleaved Reasoning for Large Language Models via Reinforcement Learning [22.403928213802036]
Long chain-of-thought (CoT) enhances large language models' (LLM) reasoning capabilities.<n>We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.
arXiv Detail & Related papers (2025-05-26T07:58:17Z)
Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning [54.65050470296886]
We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps.<n>We demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets.<n>Our approach effectively enhances distilled data and provides better starting points for reinforcement learning.
arXiv Detail & Related papers (2025-05-20T17:59:31Z)
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models [2.9925837108958864]
Test-Time Scaling emerges as an active research focus in the large language model community.<n>Recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy.<n>This paper introduces Serial-Group Decaying-Reward Policy Optimization (S-GRPO), a novel reinforcement learning paradigm.
arXiv Detail & Related papers (2025-05-12T15:50:44Z)
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models [23.34070841541423]
We propose Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT)<n>Our experiments demonstrate that models trained using LS-Mixture SFT, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3%.<n>This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning.
arXiv Detail & Related papers (2025-05-06T12:18:11Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)
Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z)
ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs) Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.