Compression of Generative Pre-trained Language Models via Quantization
- URL: http://arxiv.org/abs/2203.10705v1
- Date: Mon, 21 Mar 2022 02:11:35 GMT
- Title: Compression of Generative Pre-trained Language Models via Quantization
- Authors: Chaofan Tao, Lu Hou, Wei Zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping
Luo, Ngai Wong
- Abstract summary: We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
- Score: 62.80110048377957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing size of generative Pre-trained Language Models (PLMs) has
greatly increased the demand for model compression. Despite various methods to
compress BERT or its variants, there are few attempts to compress generative
PLMs, and the underlying difficulty remains unclear. In this paper, we compress
generative PLMs by quantization. We find that previous quantization methods
fail on generative tasks due to the \textit{homogeneous word embeddings} caused
by reduced capacity, and \textit{varied distribution of weights}.
Correspondingly, we propose a token-level contrastive distillation to learn
distinguishable word embeddings, and a module-wise dynamic scaling to make
quantizers adaptive to different modules. Empirical results on various tasks
show that our proposed method outperforms the state-of-the-art compression
methods on generative PLMs by a clear margin. With comparable performance with
the full-precision models, we achieve 14.4x and 13.4x compression rates on
GPT-2 and BART, respectively.
Related papers
- Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint.
We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale.
We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z) - Revisiting Offline Compression: Going Beyond Factorization-based Methods
for Transformer Language Models [7.542276054279341]
transformer language models achieve outstanding results in many natural language processing (NLP) tasks.
Their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks.
In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model.
arXiv Detail & Related papers (2023-02-08T13:36:06Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z) - What do Compressed Large Language Models Forget? Robustness Challenges
in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning.
We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets.
We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.