Related papers: DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization

URL: http://arxiv.org/abs/2312.13211v1
Date: Wed, 20 Dec 2023 17:27:25 GMT
Title: DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization
Authors: Rahul Chand, Yashoteja Prabhu, Pratyush Kumar
Abstract summary: DSFormer is a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. Our approach is also to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers.
Score: 12.277820111814691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the tremendous success of large transformer models in natural language understanding, down-sizing them for cost-effective deployments has become critical. Recent studies have explored the low-rank weight factorization techniques which are efficient to train, and apply out-of-the-box to any transformer architecture. Unfortunately, the low-rank assumption tends to be over-restrictive and hinders the expressiveness of the compressed model. This paper proposes, DSFormer, a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. The resulting approximation is more faithful to the weight distribution in transformers and therefore achieves a stronger efficiency-accuracy trade-off. Another concern with existing factorizers is their dependence on a task-unaware initialization step which degrades the accuracy of the resulting model. DSFormer addresses this issue through a novel Straight-Through Factorizer (STF) algorithm that jointly learns all the weight factorizations to directly maximize the final task accuracy. Extensive experiments on multiple natural language understanding benchmarks demonstrate that DSFormer obtains up to 40% better compression than the state-of-the-art low-rank factorizers, leading semi-structured sparsity baselines and popular knowledge distillation approaches. Our approach is also orthogonal to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers. We empirically evaluate the benefits of STF over conventional optimization practices.

Related papers

Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner. Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores. Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z)
Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio. We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z)
SEE: Sememe Entanglement Encoding for Transformer-bases Models Compression [20.824040486029354]
Transformer-based large language models exhibit groundbreaking capabilities, but their storage and computational costs are high, limiting their application in resource-constrained scenarios. An effective approach is to eliminate redundant model parameters and computational costs while incorporating efficient expert-derived knowledge structures to achieve a balance between compression and performance.
arXiv Detail & Related papers (2024-12-15T12:01:43Z)
BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits [2.7651063843287718]
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. We propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture.
arXiv Detail & Related papers (2024-12-06T17:58:14Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting. We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
Convolutional Neural Network Compression Based on Low-Rank Decomposition [3.3295360710329738]
This paper proposes a model compression method that integrates Variational Bayesian Matrix Factorization. VBMF is employed to estimate the rank of the weight tensor at each layer. Experimental results show that for both high and low compression ratios, our compression model exhibits advanced performance.
arXiv Detail & Related papers (2024-08-29T06:40:34Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV) Model compression methods reduce the memory and computational cost of Transformer. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate. We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z)
Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding [12.030179065286928]
The paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models. A layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63times$ compression ratio, little accuracy loss and remarkable inference and training speedup.
arXiv Detail & Related papers (2023-06-01T18:32:08Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Compressing Pre-trained Transformers via Low-Bit NxM Sparsity for Natural Language Understanding [20.75335227098455]
Large pre-trained Transformer networks have demonstrated dramatic improvements in many natural language understanding tasks. New hardware supporting both NM semi-structured sparsity and low-precision integer computation is a promising solution to boost model serving efficiency. We propose a flexible compression framework NxMiFormer that performs simultaneous sparsification and quantization.
arXiv Detail & Related papers (2022-06-30T04:33:50Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.