Blockwise Compression of Transformer-based Models without Retraining
- URL: http://arxiv.org/abs/2304.01483v2
- Date: Sun, 17 Sep 2023 22:47:50 GMT
- Title: Blockwise Compression of Transformer-based Models without Retraining
- Authors: Gaochen Dong, Wei Chen
- Abstract summary: We propose BCT, a framework of blockwise compression for transformers without retraining.
Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise.
BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results.
- Score: 6.118476907408718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have
recently garnered considerable attention in both academia and industry due to
their promising performance in general language tasks. Nevertheless, these
models typically involve computationally encoding processes, and in some cases,
decoding processes as well, both of which are fundamentally large-scale matrix
multiplication. These operations bring the inevitable challenges of massive
computation resources and huge memory footprint, usually requiring at least
10^23 FLOPs and hundreds of gigabytes, respectively. A common method to address
this issue is to reduce the computational and memory requirements by applying
layerwise quantization to the transformer, replacing the usual fp32 data type
with a low-bit equivalent. Unfortunately, this method often leads to decreased
model accuracy and necessitates time-consuming retraining. Such retraining not
only requires fine-tuning skills but also substantial computational resources,
posing challenges for users. To specifically tackle these issues, we propose
BCT, a framework of blockwise compression for transformers without retraining,
aiming to facilitate model deployment. Unlike layerwise compression methods,
BCT achieves finer compression of the entire transformer by operating
blockwise. This method mitigates data distribution deviation caused by
quantization, eliminating the requirement for retraining. BCT effectively
compresses all components of the model, including but not limited to the
embedding, matrix multiplication, GELU, Softmax, layer normalization, and
intermediate results. In a case study, an efficient model is compressed by BCT
achieving up to 7.988x compression. Subsequently, we also evaluate it on
several General Language Understanding Evaluation (GLUE) datasets.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Block-wise Bit-Compression of Transformer-based Models [9.77519365079468]
We propose BBCT, a method of block-wise bit-compression for transformer without retraining.
Our benchmark test results on General Language Understanding Evaluation (GLUE) show that BBCT can achieve less than 1% accuracy drop in most tasks.
arXiv Detail & Related papers (2023-03-16T09:53:57Z) - Knowledge Distillation in Vision Transformers: A Critical Review [6.508088032296086]
Vision Transformers (ViTs) have demonstrated impressive performance improvements over Convolutional Neural Networks (CNNs)
Model compression has recently attracted considerable research attention as a potential remedy.
This paper discusses various approaches based upon KD for effective compression of ViT models.
arXiv Detail & Related papers (2023-02-04T06:30:57Z) - Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation.
These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern.
We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.