Related papers: SLIM: Subtrajectory-Level Elimination for More Effective Reasoning

SLIM: Subtrajectory-Level Elimination for More Effective Reasoning

URL: http://arxiv.org/abs/2508.19502v1
Date: Wed, 27 Aug 2025 01:19:44 GMT
Title: SLIM: Subtrajectory-Level Elimination for More Effective Reasoning
Authors: Xifeng Yao, Chengyuan Ma, Dongyu Lang, Yinhao Ni, Zhiwei Xu, Huarui Xie, Zihao Chen, Guang Shen, Dandan Tu, Yi Bai, Changzheng Zhang,
Abstract summary: Fine-tuning models with complex reasoning trajectories may not always be optimal.<n>We develop a "5+2" framework to identify suboptimal subtrajectories within the reasoning trajectory.<n>Our method can reduce the number of suboptimal subtrajectories by 25.9% during the inference.
Score: 11.751542939912186
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In recent months, substantial progress has been made in complex reasoning of Large Language Models, particularly through the application of test-time scaling. Notable examples include o1/o3/o4 series and DeepSeek-R1. When responding to a query, these models generate an extended reasoning trajectory, during which the model explores, reflects, backtracks, and self-verifies before arriving at a conclusion. However, fine-tuning models with such reasoning trajectories may not always be optimal. Our findings indicate that not all components within these reasoning trajectories contribute positively to the reasoning process; in fact, some components may affect the overall performance negatively. In this study, we divide a reasoning trajectory into individual subtrajectories and develop a "5+2" framework to: (1) systematically identify suboptimal subtrajectories within the reasoning trajectory based on five human-established criteria; (2) assess the independence of the suboptimal subtrajectories identified in (1) from the subsequent content, ensuring that their elimination does not compromise overall flow and coherence of the reasoning process. Additionally, a sampling algorithm, built upon the "5+2" framework, is employed to select data whose reasoning process is free from suboptimal subtrajectories to the highest degree. Experimental results demonstrate that our method can reduce the number of suboptimal subtrajectories by 25.9\% during the inference. Furthermore, our method achieves an average accuracy of 58.92\% on highly challenging math benchmarks with only two thirds of training data, surpassing the average accuracy of 58.06\% achieved with the entire data, and outperforming open-source datasets, when fine-tuning Qwen2.5-Math-7B. Finally, We validated our method under resource constraints and observed improved performance across various inference token limits.

Related papers

Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning [11.179446105672461]
We propose a multi-stage efficient reasoning method that combines supervised fine-tuning and reinforcement learning.<n>Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models.<n>It achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods.
arXiv Detail & Related papers (2026-01-06T12:31:51Z)
Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search [62.1546099504045]
We propose a dual-phase test-time scaling framework that separates reasoning into planning and execution.<n>Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately.<n> Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing computation redundant.
arXiv Detail & Related papers (2025-09-29T19:27:23Z)
Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning [10.186434946738201]
Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving.<n>Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning.
arXiv Detail & Related papers (2025-08-13T15:32:25Z)
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study [91.78803511141975]
This work focuses on the roles of positive and negative samples in scaling reinforcement learning.<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z)
Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning [21.70706473875226]
We propose Reinforcement Distillation (REDI), a two-stage framework.<n> Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT)<n>Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective.<n>Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks.
arXiv Detail & Related papers (2025-05-30T17:47:17Z)
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z)
Scaling Reasoning can Improve Factuality in Large Language Models [7.184302333801519]
We thoroughly examine large language model (LLM) reasoning within complex open-domain question-answering (QA) scenarios.<n>To enrich reasoning traces, we introduce factual information from knowledge graphs in the form of paths into our reasoning traces.<n>Our findings indicate that, within a single run, smaller reasoning models achieve noticeable improvements in factual accuracy compared to their original instruction-tuned counterparts.
arXiv Detail & Related papers (2025-05-16T11:39:33Z)
When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference [11.704110756342212]
We propose a novel dynamic inference method specifically tailored for code completion models. It can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.
arXiv Detail & Related papers (2024-01-18T13:26:53Z)
Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.