DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
Weight Factorization
- URL: http://arxiv.org/abs/2312.13211v1
- Date: Wed, 20 Dec 2023 17:27:25 GMT
- Title: DSFormer: Effective Compression of Text-Transformers by Dense-Sparse
Weight Factorization
- Authors: Rahul Chand, Yashoteja Prabhu, Pratyush Kumar
- Abstract summary: DSFormer is a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix.
Our approach is also to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers.
- Score: 12.277820111814691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the tremendous success of large transformer models in natural language
understanding, down-sizing them for cost-effective deployments has become
critical. Recent studies have explored the low-rank weight factorization
techniques which are efficient to train, and apply out-of-the-box to any
transformer architecture. Unfortunately, the low-rank assumption tends to be
over-restrictive and hinders the expressiveness of the compressed model. This
paper proposes, DSFormer, a simple alternative factorization scheme which
expresses a target weight matrix as the product of a small dense and a
semi-structured sparse matrix. The resulting approximation is more faithful to
the weight distribution in transformers and therefore achieves a stronger
efficiency-accuracy trade-off. Another concern with existing factorizers is
their dependence on a task-unaware initialization step which degrades the
accuracy of the resulting model. DSFormer addresses this issue through a novel
Straight-Through Factorizer (STF) algorithm that jointly learns all the weight
factorizations to directly maximize the final task accuracy. Extensive
experiments on multiple natural language understanding benchmarks demonstrate
that DSFormer obtains up to 40% better compression than the state-of-the-art
low-rank factorizers, leading semi-structured sparsity baselines and popular
knowledge distillation approaches. Our approach is also orthogonal to
mainstream compressors and offers up to 50% additional compression when added
to popular distilled, layer-shared and quantized transformers. We empirically
evaluate the benefits of STF over conventional optimization practices.
Related papers
- Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio.
We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter.
This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z) - SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression [20.824040486029354]
Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are high, limiting their application in resource-constrained scenarios.
An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance.
arXiv Detail & Related papers (2024-12-15T12:01:43Z) - BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits [2.7651063843287718]
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications.
Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions.
We propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture.
arXiv Detail & Related papers (2024-12-06T17:58:14Z) - Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization.
VBMF is employed to estimate the rank of the weight tensor at each layer.
Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Quantization-Aware and Tensor-Compressed Training of Transformers for
Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models.
A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer.
The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.