Related papers: A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

URL: http://arxiv.org/abs/2510.14444v1
Date: Thu, 16 Oct 2025 08:43:09 GMT
Title: A Free Lunch in LLM Compression: Revisiting Retraining after Pruning
Authors: Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta,
Abstract summary: We study the key design choices when reconstructing or retraining the remaining weights after pruning.<n>In particular, we observe a free lunch scenario: reconstructing attention and components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity.<n>Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery.
Score: 23.87950717135044
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Neural Network pruning typically requires retraining the model to recover pruning-induced performance degradation, state-of-the-art Large Language Models (LLMs) pruning methods instead solve a layer-wise mask selection and reconstruction problem on a small set of calibration data to avoid full retraining, as it is considered computationally infeasible for LLMs. Reconstructing single matrices in isolation has favorable properties, such as convexity of the objective and significantly reduced memory requirements compared to full retraining. In practice, however, reconstruction is often implemented at coarser granularities, e.g., reconstructing a whole transformer block against its dense activations instead of a single matrix. In this work, we study the key design choices when reconstructing or retraining the remaining weights after pruning. We conduct an extensive computational study on state-of-the-art GPT architectures, and report several surprising findings that challenge common intuitions about retraining after pruning. In particular, we observe a free lunch scenario: reconstructing attention and MLP components separately within each transformer block is nearly the most resource-efficient yet achieves the best perplexity. Most importantly, this Pareto-optimal setup achieves better performance than full retraining, despite requiring only a fraction of the memory. Furthermore, we demonstrate that simple and efficient pruning criteria such as Wanda can outperform much more complex approaches when the reconstruction step is properly executed, highlighting its importance. Our findings challenge the narrative that retraining should be avoided at all costs and provide important insights into post-pruning performance recovery for LLMs.

Related papers

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z)
Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe [61.68406997155879]
State-of-the-art Large Language Model (LLM) pruning methods operate layer-wise, minimizing the per-layer pruning error on a small dataset to avoid full retraining.<n>Existing methods hence rely on greedy convexs that ignore the weight interactions in the pruning objective.<n>Our method drastically reduces the per-layer pruning error, outperforms strong baselines on state-of-the-art GPT architectures, and remains memory-efficient.
arXiv Detail & Related papers (2025-10-15T16:13:44Z)
Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z)
Olica: Efficient Structured Pruning of Large Language Models without Retraining [0.1534667887016089]
Existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish corrupted correlations.<n>We propose a pruning framework for LLMs called Orthogonal decomposition and Linear decomposition (Olica)<n>The proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.
arXiv Detail & Related papers (2025-06-10T04:19:38Z)
Boosting All-in-One Image Restoration via Self-Improved Privilege Learning [72.35265021054471]
Self-Improved Privilege Learning (SIPL) is a novel paradigm that overcomes limitations by extending the utility of privileged information (PI) beyond training into the inference stage.<n>Central to SIPL is Proxy Fusion, a lightweight module incorporating a learnable Privileged Dictionary.<n>Extensive experiments demonstrate that SIPL significantly advances the state-of-the-art on diverse all-in-one image restoration benchmarks.
arXiv Detail & Related papers (2025-05-30T04:36:52Z)
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining [16.026565606764954]
We simplify the pruning process for Transformer-based large language models (LLMs) We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
arXiv Detail & Related papers (2024-07-26T23:53:59Z)
A deeper look at depth pruning of LLMs [49.30061112976263]
Large Language Models (LLMs) are resource-intensive to train but more costly to deploy in production. Recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance. We show that adaptive metrics exhibit a trade-off in performance between tasks.
arXiv Detail & Related papers (2024-07-23T08:40:27Z)
Reconstruct the Pruned Model without Any Retraining [23.235907813011174]
We introduce the Linear Interpolation-based Adaptive Reconstruction (LIAR) framework, which is both efficient and effective. LIAR does not require back-propagation or retraining and is compatible with various pruning criteria and modules. Our evaluations on benchmarks such as GLUE, SQuAD, WikiText, and common sense reasoning show that LIAR enables a BERT model to maintain 98% accuracy even after removing 50% of its parameters.
arXiv Detail & Related papers (2024-07-18T09:30:44Z)
Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization [18.24882084542254]
We present an array of reconstruction techniques that can significantly reduce this error by more than $90%$. We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization.
arXiv Detail & Related papers (2024-06-21T05:13:34Z)
AdaIR: Exploiting Underlying Similarities of Image Restoration Tasks with Adapters [57.62742271140852]
AdaIR is a novel framework that enables low storage cost and efficient training without sacrificing performance. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen.
arXiv Detail & Related papers (2024-04-17T15:31:06Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs [22.557682089926004]
We show that updating a small subset of parameters can suffice to recover or even enhance performance after pruning.<n>We introduce two novel LoRA variants that, unlike standard LoRA, allow merging adapters back without compromising sparsity.
arXiv Detail & Related papers (2023-12-23T11:45:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.