Related papers: Exploring Extreme Parameter Compression for Pre-trained Language Models

Exploring Extreme Parameter Compression for Pre-trained Language Models

URL: http://arxiv.org/abs/2205.10036v1
Date: Fri, 20 May 2022 09:16:55 GMT
Title: Exploring Extreme Parameter Compression for Pre-trained Language Models
Authors: Yuxin Ren, Benyou Wang, Lifeng Shang, Xin Jiang, Qun Liu
Abstract summary: This work explores larger compression ratios for pre-trained language models (PLMs) Two decomposition and reconstruction protocols are proposed to improve the effectiveness and efficiency during compression. A tiny version achieves $96.7%$ performance of BERT-base with $ 1/48 $ encoder parameters and $2.7 times$ faster on inference.
Score: 45.80044281531393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.

Related papers

PGB: One-Shot Pruning for BERT via Weight Grouping and Permutation [5.888489927450056]
This paper proposes a novel semi-structured one-shot pruning method for BERT, called $textitPermutation and Grouping for BERT$ (PGB) PGB identifies important groups of individual weights by permutation and prunes all other weights as a structure in both multi-head attention and feed-forward layers. Our experimental results on BERT$_textBASE$ demonstrate that PGB outperforms the state-of-the-art structured pruning methods in terms of computational cost and accuracy preservation.
arXiv Detail & Related papers (2025-02-06T11:34:41Z)
MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost. Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models [23.12519490211362]
This paper studies the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models. We introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information. We investigate the impact of this pruning method when compounding compression approaches for Transformer-based models.
arXiv Detail & Related papers (2022-03-14T16:40:31Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks. Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios. We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture. Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z)
TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z)
AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search [79.98686989604164]
Existing methods compress BERT into small models while such compression is task-independent, i.e., the same compressed BERT for all different downstream tasks. We propose a novel compression method, AdaBERT, that leverages differentiable Neural Architecture Search to automatically compress BERT into task-adaptive small models for specific tasks. We evaluate AdaBERT on several NLP tasks, and the results demonstrate that those task-adaptive compressed models are 12.7x to 29.3x faster than BERT in inference time and 11.5x to 17.0x smaller in terms of parameter size.
arXiv Detail & Related papers (2020-01-13T14:03:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.