DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
Weight Factorization
- URL: http://arxiv.org/abs/2312.13211v1
- Date: Wed, 20 Dec 2023 17:27:25 GMT
- Title: DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
Weight Factorization
- Authors: Rahul Chand, Yashoteja Prabhu, Pratyush Kumar
- Abstract summary: DSFormer is a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix.
Our approach is also to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers.
- Score: 12.277820111814691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the tremendous success of large transformer models in natural language
understanding, down-sizing them for cost-effective deployments has become
critical. Recent studies have explored the low-rank weight factorization
techniques which are efficient to train, and apply out-of-the-box to any
transformer architecture. Unfortunately, the low-rank assumption tends to be
over-restrictive and hinders the expressiveness of the compressed model. This
paper proposes, DSFormer, a simple alternative factorization scheme which
expresses a target weight matrix as the product of a small dense and a
semi-structured sparse matrix. The resulting approximation is more faithful to
the weight distribution in transformers and therefore achieves a stronger
efficiency-accuracy trade-off. Another concern with existing factorizers is
their dependence on a task-unaware initialization step which degrades the
accuracy of the resulting model. DSFormer addresses this issue through a novel
Straight-Through Factorizer (STF) algorithm that jointly learns all the weight
factorizations to directly maximize the final task accuracy. Extensive
experiments on multiple natural language understanding benchmarks demonstrate
that DSFormer obtains up to 40% better compression than the state-of-the-art
low-rank factorizers, leading semi-structured sparsity baselines and popular
knowledge distillation approaches. Our approach is also orthogonal to
mainstream compressors and offers up to 50% additional compression when added
to popular distilled, layer-shared and quantized transformers. We empirically
evaluate the benefits of STF over conventional optimization practices.
Related papers
- Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization.
VBMF is employed to estimate the rank of the weight tensor at each layer.
Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate.
We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for
Natural Language Understanding [20.75335227098455]
Large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks.
New hardware supporting both NM semi-structured sparsity and low-precision integer computation is a promising solution to boost model serving efficiency.
We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization.
arXiv Detail & Related papers (2022-06-30T04:33:50Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.