Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
- URL: http://arxiv.org/abs/2512.02185v1
- Date: Mon, 01 Dec 2025 20:27:05 GMT
- Title: Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
- Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang,
- Abstract summary: Reasoning LLMs (RLMs) deliver strong multi-step reasoning through chain-of-thought generation.<n>RLMs' large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings.<n>We introduce RESP, a structured pruning framework that aligns pruning decisions with the model's reasoning dynamics.
- Score: 31.422773877490613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
Related papers
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z) - Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z) - Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models [5.653755499165773]
We introduce Bayesian-LoRA, which reformulates the deterministic LoRA update as a probabilistic low-rank representation inspired by Sparse Gaussian Processes.<n>With only approximately 0.42M additional parameters and $approx1.2times$ training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B.
arXiv Detail & Related papers (2026-01-28T19:54:31Z) - LLMs can Compress LLMs: Adaptive Pruning by Agents [0.0]
Post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance.<n>We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent.<n>We evaluate our approach on Q3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines.
arXiv Detail & Related papers (2026-01-14T18:45:36Z) - EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs [9.412828452977553]
Existing approaches reinforce successful reasoning paths, incurring a substantial calibration cost.<n>This failure has been characterized as a form of model collapse in alignment.<n>We proposeEpiCaR as a training objective that jointly optimize reasoning performance and calibration.
arXiv Detail & Related papers (2026-01-11T06:21:13Z) - Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models [48.973207827896]
We show that using self-generated reasoning data for calibration can substantially improve pruning performance.<n>Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data.
arXiv Detail & Related papers (2025-11-24T08:08:19Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z) - Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression [30.653381666162275]
Certainty-Guided Reflection Suppression (CGRS) is a novel method that mitigates overthinking in Large Reasoning Language Models (LRLMs)<n>CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response.<n>Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines.
arXiv Detail & Related papers (2025-08-07T12:38:22Z) - Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.931194824519935]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Self-Data Distillation for Recovering Quality in Pruned Large Language Models [1.5665059604715017]
One-shot pruning results in significant quality degradation, particularly in tasks requiring multi-step reasoning.<n>To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting.<n>In this work, we utilize self-data distilled fine-tuning to address these challenges.
arXiv Detail & Related papers (2024-10-13T19:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.