SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
- URL: http://arxiv.org/abs/2508.12604v1
- Date: Mon, 18 Aug 2025 04:02:15 GMT
- Title: SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
- Authors: Yuyang Xu, Yi Cheng, Haochao Ying, Zhuoyun Du, Renjun Hu, Xing Shi, Wei Lin, Jian Wu,
- Abstract summary: Post-training methods incur substantial computational overhead due to auxiliary models and overthinking.<n>We propose Self-traced Step-wise Preference Optimization (SSPO), a plug RLgable process supervision framework.<n>SSPO uses step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression.
- Score: 15.87106741558898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
Related papers
- Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search [62.1546099504045]
We propose a dual-phase test-time scaling framework that separates reasoning into planning and execution.<n>Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately.<n> Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing computation redundant.
arXiv Detail & Related papers (2025-09-29T19:27:23Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [57.16286134405821]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z) - LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling [29.721108461390973]
We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step.<n>PIR identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components.<n>Our approach demonstrates strong generalizability across different model sizes, data sources, and token budgets.
arXiv Detail & Related papers (2025-05-25T15:17:57Z) - Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.08396797526657]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models [56.37421741507468]
Chain-of-Thought (CoT) reasoning has significantly enhanced the performance of large language models (LLMs)<n>We propose a method to identify critical reasoning steps using perplexity as a measure of their importance.
arXiv Detail & Related papers (2025-02-18T20:04:51Z) - PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking [0.0]
PRefLexOR combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach.
We focus on applications in biological materials science and demonstrate the method in a variety of case studies.
arXiv Detail & Related papers (2024-10-16T08:46:26Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.