Language model compression with weighted low-rank factorization
- URL: http://arxiv.org/abs/2207.00112v1
- Date: Thu, 30 Jun 2022 21:57:07 GMT
- Title: Language model compression with weighted low-rank factorization
- Authors: Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, Hongxia
Jin
- Abstract summary: We introduce Fisher information to weigh the importance of parameters affecting the model prediction.
We find that our resulting task accuracy is much closer to the original model's performance.
Our method can directly compress a task-specific model while achieving better performance than other compact model strategies.
- Score: 73.61874728240568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Factorizing a large matrix into small matrices is a popular strategy for
model compression. Singular value decomposition (SVD) plays a vital role in
this compression strategy, approximating a learned matrix with fewer
parameters. However, SVD minimizes the squared error toward reconstructing the
original matrix without gauging the importance of the parameters, potentially
giving a larger reconstruction error for those who affect the task accuracy
more. In other words, the optimization objective of SVD is not aligned with the
trained model's task accuracy. We analyze this previously unexplored problem,
make observations, and address it by introducing Fisher information to weigh
the importance of parameters affecting the model prediction. This idea leads to
our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from
our approach do not result in smaller reconstruction errors, we find that our
resulting task accuracy is much closer to the original model's performance. We
perform analysis with the transformer-based language models, showing our
weighted SVD largely alleviates the mismatched optimization objectives and can
maintain model performance with a higher compression rate. Our method can
directly compress a task-specific model while achieving better performance than
other compact model strategies requiring expensive model pre-training.
Moreover, the evaluation of compressing an already compact model shows our
method can further reduce 9% to 30% parameters with an insignificant impact on
task accuracy.
Related papers
- SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - TRAWL: Tensor Reduced and Approximated Weights for Large Language Models [11.064868044313855]
We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns.
Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.
arXiv Detail & Related papers (2024-06-25T04:01:32Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - Efficient Compression of Overparameterized Deep Models through
Low-Dimensional Learning Dynamics [10.673414267895355]
We present a novel approach for compressing over parameterized models.
Our algorithm improves the training efficiency by more than 2x, without compromising generalization.
arXiv Detail & Related papers (2023-11-08T23:57:03Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Numerical Optimizations for Weighted Low-rank Estimation on Language
Model [73.12941276331316]
Singular value decomposition (SVD) is one of the most popular compression methods that approximates a target matrix with smaller matrices.
Standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption.
We show that our method can perform better than current SOTA methods in neural-based language models.
arXiv Detail & Related papers (2022-11-02T00:58:02Z) - Multi-Dimensional Model Compression of Vision Transformer [21.8311401851523]
Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment.
Previous ViT pruning methods tend to prune the model along one dimension solely.
We advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly.
arXiv Detail & Related papers (2021-12-31T19:54:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.