Related papers: RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model

RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model

URL: http://arxiv.org/abs/2510.07782v1
Date: Thu, 09 Oct 2025 04:54:09 GMT
Title: RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model
Authors: Shuichiro Haruta, Kazunori Matsumoto, Zhi Li, Yanan Wang, Mori Kurokawa,
Abstract summary: We propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs)<n>LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space.<n>In experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks.
Score: 4.7618979083425215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.

Related papers

POLARIS: Projection-Orthogonal Least Squares for Robust and Adaptive Inversion in Diffusion Models [9.779983925649857]
Inversion-Denoising Paradigm, which is based on diffusion models, excels in diverse image editing and restoration tasks.<n>We revisit its mechanism and reveal a critical, overlooked factor in reconstruction degradation: the approximate noise error.<n>We introduce Projection-Orthogonal Least Squares for Robust and Adaptive Inversion (POLARIS), which reformulates inversion from an error-compensation problem into an error-origin problem.
arXiv Detail & Related papers (2025-11-29T07:35:20Z)
Olica: Efficient Structured Pruning of Large Language Models without Retraining [0.1534667887016089]
Existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish corrupted correlations.<n>We propose a pruning framework for LLMs called Orthogonal decomposition and Linear decomposition (Olica)<n>The proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.
arXiv Detail & Related papers (2025-06-10T04:19:38Z)
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression [15.784158079414235]
FLAT-LLM is a training-free structural compression method based on fine-grained low-rank transformations in the activation space.<n>It achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes.
arXiv Detail & Related papers (2025-05-29T19:42:35Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization [18.24882084542254]
We present an array of reconstruction techniques that can significantly reduce this error by more than $90%$. We find out that a strategy of self-generating calibration data can mitigate this trade-off between reconstruction and generalization.
arXiv Detail & Related papers (2024-06-21T05:13:34Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Revisiting Rotation Averaging: Uncertainties and Robust Losses [51.64986160468128]
We argue that the main problem of current methods is the minimized cost function that is only weakly connected with the input data via the estimated epipolar. We propose to better model the underlying noise distributions by directly propagating the uncertainty from the point correspondences into the rotation averaging.
arXiv Detail & Related papers (2023-03-09T11:51:20Z)
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z)
Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction [49.25830718574892]
We present a new framework named Tail-to-Tail (textbfTtT) non-autoregressive sequence prediction. Considering that most tokens are correct and can be conveyed directly from source to target, and the error positions can be estimated and corrected. Experimental results on standard datasets, especially on the variable-length datasets, demonstrate the effectiveness of TtT in terms of sentence-level Accuracy, Precision, Recall, and F1-Measure.
arXiv Detail & Related papers (2021-06-03T05:56:57Z)
MLPruning: A Multilevel Structured Pruning Framework for Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.