Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
- URL: http://arxiv.org/abs/2505.14116v1
- Date: Tue, 20 May 2025 09:21:26 GMT
- Title: Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
- Authors: Hongru Wang, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong,
- Abstract summary: We introduce textitSelf-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and improve performance through self-training.<n>Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks.
- Score: 42.40884882220895
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition, such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce \textit{Self-Reasoning Language Model} (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute $+7.89$ average improvement with $64$ sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline.
Related papers
- Learning to Reason in LLMs by Expectation Maximization [55.721496945401846]
We formalize reasoning as a latent variable model and derive an expectation-maximization objective for learning to reason.<n>This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers.
arXiv Detail & Related papers (2025-12-23T08:56:49Z) - MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization [66.82303841930752]
diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs)<n>DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases.<n>We propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process.
arXiv Detail & Related papers (2025-10-24T13:57:59Z) - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z) - Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning [3.364797975300393]
We present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n>We construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training.<n>Experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks.
arXiv Detail & Related papers (2025-05-18T14:08:03Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models.<n>We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models.<n>For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z) - Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models [33.547353090281284]
We propose a novel reward model approach called the Hierarchical Reward Model.<n>It evaluates both individual and consecutive reasoning steps at both fine-grained and coarse-grained levels.<n>It excels at assessing multi-step reasoning coherence, especially when flawed steps are later corrected through self-reflection.
arXiv Detail & Related papers (2025-03-16T15:18:40Z) - Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training [66.48331530995786]
We propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context.<n>Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage.<n>Experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations.
arXiv Detail & Related papers (2025-02-25T03:03:35Z) - SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation [14.786100203787194]
Large language models demonstrate exceptional performance in simple code generation tasks but face challenges in tackling complex problems.<n>We propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths.<n>Our method operates entirely through the model itself without requiring additional supervision.
arXiv Detail & Related papers (2024-11-17T12:31:04Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency.
This approach fails in scenarios where the correct answers are in the minority.
We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.