STAT: Shrinking Transformers After Training
- URL: http://arxiv.org/abs/2406.00061v1
- Date: Wed, 29 May 2024 22:59:11 GMT
- Title: STAT: Shrinking Transformers After Training
- Authors: Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle,
- Abstract summary: We present STAT, a simple algorithm to prune transformer models without any fine-tuning.
STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer.
Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
- Score: 72.0726371426711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present STAT: a simple algorithm to prune transformer models without any fine-tuning. STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer. Each layer block in the network is compressed using a series of principled matrix factorizations that preserve the network structure. Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU. Using only several hundred data examples, STAT preserves the output of the network and improves upon existing gradient-free pruning methods. It is even competitive with methods that include significant fine-tuning. We demonstrate our method on both encoder and decoder architectures, including BERT, DistilBERT, and Llama-2 using benchmarks such as GLUE, Squad, WikiText2.
Related papers
- MST-compression: Compressing and Accelerating Binary Neural Networks
with Minimum Spanning Tree [21.15961593182111]
Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices.
However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version.
This paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs.
arXiv Detail & Related papers (2023-08-26T02:42:12Z) - Neural Network Compression using Binarization and Few Full-Precision
Weights [7.206962876422061]
Automatic Prune Binarization (APB) is a novel compression technique combining quantization with pruning.
APB enhances the representational capability of binary networks using a few full-precision weights.
APB delivers better accuracy/memory trade-off compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-06-15T08:52:00Z) - Block-wise Bit-Compression of Transformer-based Models [9.77519365079468]
We propose BBCT, a method of block-wise bit-compression for transformer without retraining.
Our benchmark test results on General Language Understanding Evaluation (GLUE) show that BBCT can achieve less than 1% accuracy drop in most tasks.
arXiv Detail & Related papers (2023-03-16T09:53:57Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Rescoring Sequence-to-Sequence Models for Text Line Recognition with
CTC-Prefixes [0.0]
We propose to use the CTC-Prefix-Score during S2S decoding.
During beam search, paths that are invalid according to the CTC confidence matrix are penalised.
We evaluate this setup on three HTR data sets: IAM, Rimes, and StAZH.
arXiv Detail & Related papers (2021-10-12T11:40:05Z) - An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.
Most existing network pruning methods require laborious human efforts and prohibitive computation resources.
We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z) - Layer-Wise Data-Free CNN Compression [49.73757297936685]
We show how to generate layer-wise training data using only a pretrained network.
We present results for layer-wise compression using quantization and pruning.
arXiv Detail & Related papers (2020-11-18T03:00:05Z) - OctSqueeze: Octree-Structured Entropy Model for LiDAR Compression [77.8842824702423]
We present a novel deep compression algorithm to reduce the memory footprint of LiDAR point clouds.
Our method exploits the sparsity and structural redundancy between points to reduce the memory footprint.
Our algorithm can be used to reduce the onboard and offboard storage of LiDAR points for applications such as self-driving cars.
arXiv Detail & Related papers (2020-05-14T17:48:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.