Training Small Reasoning LLMs with Cognitive Preference Alignment
- URL: http://arxiv.org/abs/2504.09802v1
- Date: Mon, 14 Apr 2025 02:03:54 GMT
- Title: Training Small Reasoning LLMs with Cognitive Preference Alignment
- Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang,
- Abstract summary: We introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs.<n>CRV consists of multiple LLM agents, each specializing in unique abilities.<n>We propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models.
- Score: 11.367717208838101
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reasoning capabilities of large language models (LLMs), such as OpenAI's o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.
Related papers
- Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models [53.4530106173067]
Large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks.<n>RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively.<n>This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge.
arXiv Detail & Related papers (2025-04-03T04:46:17Z) - Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking [39.48406368755411]
Large Language Models (LLMs) have introduced Reasoning Large Language Models (RLLMs)<n>RLLMs exhibit innate Chain-of-Thought (CoT) reasoning capability obtained from training, leading to a natural question: "Is CoT prompting necessary to enhance the reasoning capability of RLLMs?"<n>We present the first comprehensive analysis of the impacts of Zero-shot CoT and Few-shot CoT on RLLMs across mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-25T12:37:22Z) - Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.77467538102924]
Recent advancements in Large Reasoning Models (LRMs) have demonstrated remarkable performance in specialized reasoning tasks.
We show that acquiring deliberative reasoning capabilities significantly reduces the foundational capabilities of LRMs.
We demonstrate that adaptive reasoning -- employing modes like Zero-Thinking, Less-Thinking, and Summary-Thinking -- can effectively alleviate these drawbacks.
arXiv Detail & Related papers (2025-03-23T08:18:51Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - Demystifying Long Chain-of-Thought Reasoning in LLMs [46.352406501403465]
Long chains-of-thought (CoTs) enable strategies like backtracking and error correction.<n>Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities.<n>We identify the key factors that enable models to generate long CoT trajectories.
arXiv Detail & Related papers (2025-02-05T17:13:32Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z) - Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs [22.177654792824896]
We focus on small-sized language models (3B to 7B parameters) for their cost-efficiency and accessibility.<n>We explore various training configurations and strategies across four open-source pre-trained models.<n>Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance; (iv) we observed no significant difference in performance between phased and stacked training strategies, but
arXiv Detail & Related papers (2024-12-17T21:16:59Z) - Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic [48.94340387130627]
Critic-CoT is a framework that pushes LLMs toward System-2-like critic capability.
CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation.
Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance.
arXiv Detail & Related papers (2024-08-29T08:02:09Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Beyond Task Performance: Evaluating and Reducing the Flaws of Large
Multimodal Models with In-Context Learning [105.77733287326308]
We evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following.
We explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations.
Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL.
arXiv Detail & Related papers (2023-10-01T12:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.