Related papers: Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

URL: http://arxiv.org/abs/2505.17974v1
Date: Fri, 23 May 2025 14:41:52 GMT
Title: Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models
Authors: Viktoriia Chekalina, Daniil Moskovskiy, Daria Cherniuk, Maxim Kurkin, Andrey Kuznetsov, Evgeny Frolov,
Abstract summary: Generalized Fisher-Weighted SVD (GFWSVD) is a post-training compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix.<n>We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines.
Score: 6.57101653042078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Fisher information is a fundamental concept for characterizing the sensitivity of parameters in neural networks. However, leveraging the full observed Fisher information is too expensive for large models, so most methods rely on simple diagonal approximations. While efficient, this approach ignores parameter correlations, often resulting in reduced performance on downstream tasks. In this work, we mitigate these limitations and propose Generalized Fisher-Weighted SVD (GFWSVD), a post-training LLM compression technique that accounts for both diagonal and off-diagonal elements of the Fisher information matrix, providing a more accurate reflection of parameter importance. To make the method tractable, we introduce a scalable adaptation of the Kronecker-factored approximation algorithm for the observed Fisher information. We demonstrate the effectiveness of our method on LLM compression, showing improvements over existing compression baselines. For example, at a 20 compression rate on the MMLU benchmark, our method outperforms FWSVD, which is based on a diagonal approximation of the Fisher information, by 5 percent, SVD-LLM by 3 percent, and ASVD by 6 percent compression rate.

Related papers

Efficient Model Compression Techniques with FishLeg [30.69238973086908]
FishLeg is a new second-order pruning method based on the Fisher-Legendre (FishLeg)<n>At the heart of FishLeg is a meta-learning approach to amortising the action of the inverse FIM.<n>We find that FishLeg achieves higher or comparable performance against two common baselines in the area.
arXiv Detail & Related papers (2024-12-03T09:42:16Z)
FisherMask: Enhancing Neural Network Labeling Efficiency in Image Classification Using Fisher Information [2.762397703396293]
FisherMask is a Fisher information-based active learning (AL) approach that identifies key network parameters by masking them. Our experiments demonstrate that FisherMask significantly outperforms state-of-the-art methods on diverse datasets.
arXiv Detail & Related papers (2024-11-08T18:10:46Z)
GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs [51.02233412547456]
We introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW) Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget.
arXiv Detail & Related papers (2024-08-27T14:41:14Z)
Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization [44.39409273712917]
We introduce NPEFF, an interpretability method that aims to uncover strategies used by a model to generate its predictions.<n>We demonstrate that NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks.
arXiv Detail & Related papers (2023-10-07T02:02:45Z)
Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets. We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z)
Low-rank extended Kalman filtering for online learning of neural networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix. In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z)
Numerical Optimizations for Weighted Low-rank Estimation on Language Model [73.12941276331316]
Singular value decomposition (SVD) is one of the most popular compression methods that approximates a target matrix with smaller matrices. Standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. We show that our method can perform better than current SOTA methods in neural-based language models.
arXiv Detail & Related papers (2022-11-02T00:58:02Z)
Language model compression with weighted low-rank factorization [73.61874728240568]
We introduce Fisher information to weigh the importance of parameters affecting the model prediction. We find that our resulting task accuracy is much closer to the original model's performance. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies.
arXiv Detail & Related papers (2022-06-30T21:57:07Z)
Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods. The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z)
Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects. We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
Two-Level K-FAC Preconditioning for Deep Learning [7.699428789159717]
In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix. One particularly successful variant of such methods is the so-called K-FAC, which uses a Kronecker-ed block-factored preconditioner.
arXiv Detail & Related papers (2020-11-01T17:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.