Related papers: Olica: Efficient Structured Pruning of Large Language Models without Retraining

Olica: Efficient Structured Pruning of Large Language Models without Retraining

URL: http://arxiv.org/abs/2506.08436v1
Date: Tue, 10 Jun 2025 04:19:38 GMT
Title: Olica: Efficient Structured Pruning of Large Language Models without Retraining
Authors: Jiujun He, Huazhen Lin,
Abstract summary: Existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish corrupted correlations.<n>We propose a pruning framework for LLMs called Orthogonal decomposition and Linear decomposition (Olica)<n>The proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.
Score: 0.1534667887016089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads. Additionally, to mitigate error accumulation problem caused by pruning the feed-forward network (FFN) layer, we introduce a linear calibration method to reconstruct the residual errors of pruned layers using low-rank matrices. By leveraging singular value decomposition (SVD) on the solution of the least-squares problem, these matrices are obtained without requiring retraining. Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks.

Related papers

Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study [64.26593350748401]
Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities.<n>Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs)<n>We propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training.
arXiv Detail & Related papers (2025-07-28T11:57:52Z)
Two-Stage Regularization-Based Structured Pruning for LLMs [32.65416603453818]
TRSP: Two-Stage Regularization-Based Structured Pruning for Large Language Models.<n>We show that TRSP outperforms strong layer-wise structured pruning methods without requiring retraining.<n>As a layer-wise pruning method, it delivers notable end-to-end acceleration.
arXiv Detail & Related papers (2025-05-23T12:40:59Z)
Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD [21.92418810749819]
We introduce a new explicit factorization method, Banded Inverse Square Root (BISR), which imposes a banded structure on the inverse correlation matrix.<n>BISR achieves anally optimal error by matching the upper and lower bounds.
arXiv Detail & Related papers (2025-05-17T19:41:44Z)
Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models [11.93284417365518]
We introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy.<n> COMP improves performance by 6.13% on the LLaMA-2-7B model with a 20% pruning ratio compared to LLM-Pruner.
arXiv Detail & Related papers (2025-01-25T16:03:58Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining [16.026565606764954]
We simplify the pruning process for Transformer-based large language models (LLMs) We propose two inference-aware pruning criteria derived from the optimization perspective of output approximation. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining.
arXiv Detail & Related papers (2024-07-26T23:53:59Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Fluctuation-based Adaptive Structured Pruning for Large Language Models [44.217363567065]
FLAP (FLuctuation-based Adaptive Structured Pruning) is a retraining-free structured pruning framework for Large Language Models. It is hardware-friendly by effectively reducing storage and enhancing inference speed.
arXiv Detail & Related papers (2023-12-19T09:23:48Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Large-scale gradient-based training of Mixtures of Factor Analyzers [67.21722742907981]
This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional training by gradient descent. We prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed. Besides the theoretical analysis and matrices, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.
arXiv Detail & Related papers (2023-08-26T06:12:33Z)
Solving weakly supervised regression problem using low-rank manifold regularization [77.34726150561087]
We solve a weakly supervised regression problem. Under "weakly" we understand that for some training points the labels are known, for some unknown, and for others uncertain due to the presence of random noise or other reasons such as lack of resources. In the numerical section, we applied the suggested method to artificial and real datasets using Monte-Carlo modeling.
arXiv Detail & Related papers (2021-04-13T23:21:01Z)
Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization. We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z)
Recovery of Linear Components: Reduced Complexity Autoencoder Designs [0.951828574518325]
We present an approach called Recovery of Linear Components (RLC), which serves as a middle ground between linear and non-linear dimensionality reduction techniques. With the aid of synthetic and real world case studies, we show that the RLC, when compared with an autoencoder of similar complexity, shows higher accuracy, similar to robustness to overfitting, and faster training times.
arXiv Detail & Related papers (2020-12-14T14:08:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.