Block-wise Bit-Compression of Transformer-based Models
- URL: http://arxiv.org/abs/2303.09184v2
- Date: Sat, 1 Apr 2023 12:50:29 GMT
- Title: Block-wise Bit-Compression of Transformer-based Models
- Authors: Gaochen Dong, Wei Chen
- Abstract summary: We propose BBCT, a method of block-wise bit-compression for transformer without retraining.
Our benchmark test results on General Language Understanding Evaluation (GLUE) show that BBCT can achieve less than 1% accuracy drop in most tasks.
- Score: 9.77519365079468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the popularity of the recent Transformer-based models represented by
BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range
of natural language processing tasks. However, the massive computations, huge
memory footprint, and thus high latency of Transformer-based models is an
inevitable challenge for the cloud with high real-time requirement. To tackle
the issue, we propose BBCT, a method of block-wise bit-compression for
transformer without retraining. Our method achieves more fine-grained
compression of the whole transformer, including embedding, matrix
multiplication, GELU, softmax, layer normalization, and all the intermediate
results. As a case, we compress an efficient BERT with the method of BBCT. Our
benchmark test results on General Language Understanding Evaluation (GLUE) show
that BBCT can achieve less than 1% accuracy drop in most tasks.
Related papers
- STAT: Shrinking Transformers After Training [72.0726371426711]
We present STAT, a simple algorithm to prune transformer models without any fine-tuning.
STAT eliminates both attention heads and neurons from the network, while preserving accuracy by calculating a correction to the weights of the next layer.
Our entire algorithm takes minutes to compress BERT, and less than three hours to compress models with 7B parameters using a single GPU.
arXiv Detail & Related papers (2024-05-29T22:59:11Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window.
We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Blockwise Compression of Transformer-based Models without Retraining [6.118476907408718]
We propose BCT, a framework of blockwise compression for transformers without retraining.
Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise.
BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
arXiv Detail & Related papers (2023-04-04T02:55:40Z) - ByteTransformer: A High-Performance Transformer Boosted for
Variable-Length Inputs [6.9136984255301]
We present ByteTransformer, a high-performance transformer boosted for variable-length inputs.
ByteTransformer surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, XLA, Tencent TurboTransformer and NVIDIA FasterTransformer.
arXiv Detail & Related papers (2022-10-06T16:57:23Z) - Sparse is Enough in Scaling Transformers [12.561317511514469]
Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach.
We propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer.
arXiv Detail & Related papers (2021-11-24T19:53:46Z) - Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z) - Regularizing Transformers With Deep Probabilistic Layers [62.997667081978825]
In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models.
We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
arXiv Detail & Related papers (2021-08-23T10:17:02Z) - schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model.
We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions.
In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.