Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
- URL: http://arxiv.org/abs/2509.12464v1
- Date: Mon, 15 Sep 2025 21:19:13 GMT
- Title: Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
- Authors: Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder,
- Abstract summary: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time.<n>We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks.<n>We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces.
- Score: 22.464365371107714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
Related papers
- Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z) - From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models [17.998434546981738]
Large language models (LLMs) are increasingly costly to deploy, motivating extensive research on model pruning.<n>We conduct a controlled study of pruning for both instruction-following ($textbfLLM-instruct$) and reasoning-augmented ($textbfLLM-think$) models.<n>We evaluate static depth pruning, static width pruning, and dynamic pruning across 17 tasks spanning classification, generation, and reasoning.
arXiv Detail & Related papers (2026-01-26T03:01:39Z) - Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning [61.380634253724594]
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL)<n>We show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model.
arXiv Detail & Related papers (2025-12-23T18:51:50Z) - C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression [4.10373648742522]
Pruning is a widely used technique that prompts sparsity in model structures.<n>We propose a novel one-shot pruning framework that relies on explainable deep learning.<n>Our method consistently achieves substantial reductions in model size, with minimal impact on performance, and without the need for fine-tuning.
arXiv Detail & Related papers (2025-10-21T13:40:11Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [60.37610817226533]
Chain-of-thought (CoT) reasoning encourages step-by-step intermediate reasoning during inference.<n>CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences.<n>We present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs [15.23174472320989]
Large Language Models (LLMs) are central to many contemporary AI applications.<n>Recent works in eXplainable AI (XAI) suggest that interpretability can also enable model compression.
arXiv Detail & Related papers (2025-06-16T17:38:36Z) - RL-Pruner: Structured Pruning Using Reinforcement Learning for CNN Compression and Acceleration [0.0]
We propose RL-Pruner, which uses reinforcement learning to learn the optimal pruning distribution.
RL-Pruner can automatically extract dependencies between filters in the input model and perform pruning, without requiring model-specific pruning implementations.
arXiv Detail & Related papers (2024-11-10T13:35:10Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - A Fair Loss Function for Network Pruning [70.35230425589592]
We introduce the performance weighted loss function, a simple modified cross-entropy loss function that can be used to limit the introduction of biases during pruning.
Experiments using the CelebA, Fitzpatrick17k and CIFAR-10 datasets demonstrate that the proposed method is a simple and effective tool.
arXiv Detail & Related papers (2022-11-18T15:17:28Z) - Interpretations Steered Network Pruning via Amortized Inferred Saliency
Maps [85.49020931411825]
Convolutional Neural Networks (CNNs) compression is crucial to deploying these models in edge devices with limited resources.
We propose to address the channel pruning problem from a novel perspective by leveraging the interpretations of a model to steer the pruning process.
We tackle this challenge by introducing a selector model that predicts real-time smooth saliency masks for pruned models.
arXiv Detail & Related papers (2022-09-07T01:12:11Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.