Related papers: Compressing Language Models using Doped Kronecker Products

Compressing Language Models using Doped Kronecker Products

URL: http://arxiv.org/abs/2001.08896v5
Date: Tue, 17 Nov 2020 05:27:00 GMT
Title: Compressing Language Models using Doped Kronecker Products
Authors: Urmish Thakker, Paul N. Whatmough, Zhi-Gang Liu, Matthew Mattina, Jesse Beu
Abstract summary: This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks. We call this compression method doped kronecker product compression. We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score.
Score: 16.64452087806598
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Kronecker Products (KP) have been used to compress IoT RNN Applications by 15-38x compression factors, achieving better results than traditional compression methods. However when KP is applied to large Natural Language Processing tasks, it leads to significant accuracy loss (approx 26%). This paper proposes a way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing additional degrees of freedom in the KP matrix. More formally, we propose doping, a process of adding an extremely sparse overlay matrix on top of the pre-defined KP structure. We call this compression method doped kronecker product compression. To train these models, we present a new solution to the phenomenon of co-matrix adaption (CMA), which uses a new regularization scheme called co matrix dropout regularization (CMR). We present experimental results that demonstrate compression of a large language model with LSTM layers of size 25 MB by 25x with 1.4% loss in perplexity score. At 25x compression, an equivalent pruned network leads to 7.9% loss in perplexity score, while HMD and LMF lead to 15% and 27% loss in perplexity score respectively.

Related papers

ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search [61.4807238517108]
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving.<n>CoT's extension to Long-CoT introduces substantial computational overhead due to increased token length.<n>We propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence.
arXiv Detail & Related papers (2025-05-22T16:06:59Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions. Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
R2 Loss: Range Restriction Loss for Model Compression and Quantization [6.218599842159466]
We propose Range Restriction Loss (R2-Loss) for building lower bit quantization and compression friendly models by removing outliers from weights during pre-training. R2-Loss improves lower bit quantization accuracy with state-of-the-art post-training quantization (PTQ), quantization-aware training (QAT), and model compression techniques.
arXiv Detail & Related papers (2023-03-14T21:59:21Z)
CrAM: A Compression-Aware Minimizer [103.29159003723815]
We propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way. CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning. CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware.
arXiv Detail & Related papers (2022-07-28T16:13:28Z)
Vision Transformer Compression with Structured Pruning and Low Rank Approximation [1.9685957565449135]
Transformer architecture has gained popularity due to its ability to scale with large dataset. We focus on vision transformer proposed for image recognition task. We explore the application of different compression techniques such as low rank approximation and pruning for this purpose.
arXiv Detail & Related papers (2022-03-25T04:18:07Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
Doping: A technique for efficient compression of LSTM models using sparse structured additive matrices [14.321761305835972]
We propose the notion of doping -- addition of an extremely sparse matrix to a structured matrix. Doping facilitates additional degrees of freedom for a small number of parameters, allowing them to independently diverge from the fixed structure. We show that doped KP compression technique outperforms previous state-of-the art compression results by achieving 1.3 - 2.4x higher compression factor at a similar accuracy.
arXiv Detail & Related papers (2021-02-14T05:14:09Z)
Rank and run-time aware compression of NLP Applications [12.965657113072325]
This paper proposes a new compression technique called Hybrid Matrix Factorization. It improves low-rank matrix factorization techniques by doubling the rank of the matrix. It can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
arXiv Detail & Related papers (2020-10-06T16:03:15Z)
Kernel Quantization for Efficient Network Compression [59.55192551370948]
Kernel Quantization (KQ) aims to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer.
arXiv Detail & Related papers (2020-03-11T08:00:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.