Compressing Language Models using Doped Kronecker Products
- URL: http://arxiv.org/abs/2001.08896v5
- Date: Tue, 17 Nov 2020 05:27:00 GMT
- Title: Compressing Language Models using Doped Kronecker Products
- Authors: Urmish Thakker, Paul N. Whatmough, Zhi-Gang Liu, Matthew Mattina,
Jesse Beu
- Abstract summary: This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks.
We call this compression method doped kronecker product compression.
We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score.
- Score: 16.64452087806598
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Kronecker Products (KP) have been used to compress IoT RNN Applications by
15-38x compression factors, achieving better results than traditional
compression methods. However when KP is applied to large Natural Language
Processing tasks, it leads to significant accuracy loss (approx 26%). This
paper proposes a way to recover accuracy otherwise lost when applying KP to
large NLP tasks, by allowing additional degrees of freedom in the KP matrix.
More formally, we propose doping, a process of adding an extremely sparse
overlay matrix on top of the pre-defined KP structure. We call this compression
method doped kronecker product compression. To train these models, we present a
new solution to the phenomenon of co-matrix adaption (CMA), which uses a new
regularization scheme called co matrix dropout regularization (CMR). We present
experimental results that demonstrate compression of a large language model
with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score. At
25x compression, an equivalent pruned network leads to 7.9% loss in perplexity
score, while HMD and LMF lead to 15% and 27% loss in perplexity score
respectively.
Related papers
- MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - R2 Loss: Range Restriction Loss for Model Compression and Quantization [6.218599842159466]
We propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training.
R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques.
arXiv Detail & Related papers (2023-03-14T21:59:21Z) - CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way.
CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning.
CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z) - Vision Transformer Compression with Structured Pruning and Low Rank
Approximation [1.9685957565449135]
Transformer architecture has gained popularity due to its ability to scale with large dataset.
We focus on vision transformer proposed for image recognition task.
We explore the application of different compression techniques such as low rank approximation and pruning for this purpose.
arXiv Detail & Related papers (2022-03-25T04:18:07Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models.
We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z) - Doping: A technique for efficient compression of LSTM models using
sparse structured additive matrices [14.321761305835972]
We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix.
Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure.
We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
arXiv Detail & Related papers (2021-02-14T05:14:09Z) - Rank and run-time aware compression of NLP Applications [12.965657113072325]
This paper proposes a new compression technique called Hybrid Matrix Factorization.
It improves low-rank matrix factorization techniques by doubling the rank of the matrix.
It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
arXiv Detail & Related papers (2020-10-06T16:03:15Z) - Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss.
Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level.
Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.