Exploring Extreme Parameter Compression for Pre-trained Language Models
- URL: http://arxiv.org/abs/2205.10036v1
- Date: Fri, 20 May 2022 09:16:55 GMT
- Title: Exploring Extreme Parameter Compression for Pre-trained Language Models
- Authors: Yuxin Ren, Benyou Wang, Lifeng Shang, Xin Jiang, Qun Liu
- Abstract summary: This work explores larger compression ratios for pre-trained language models (PLMs)
Two decomposition and reconstruction protocols are proposed to improve the effectiveness and efficiency during compression.
A tiny version achieves $96.7%$ performance of BERT-base with $ 1/48 $ encoder parameters and $2.7 times$ faster on inference.
- Score: 45.80044281531393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work explored the potential of large-scale Transformer-based
pre-trained models, especially Pre-trained Language Models (PLMs) in natural
language processing. This raises many concerns from various perspectives, e.g.,
financial costs and carbon emissions. Compressing PLMs like BERT with
negligible performance loss for faster inference and cheaper deployment has
attracted much attention. In this work, we aim to explore larger compression
ratios for PLMs, among which tensor decomposition is a potential but
under-investigated one. Two decomposition and reconstruction protocols are
further proposed to improve the effectiveness and efficiency during
compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer
layers performs on-par with, sometimes slightly better than the original BERT
in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base
with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding
the embedding layer) and $2.7 \times$ faster on inference. To show that the
proposed method is orthogonal to existing compression methods like knowledge
distillation, we also explore the benefit of the proposed method on a distilled
BERT.
Related papers
- MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for
Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models.
We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information.
We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural
Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks.
We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks.
We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.