BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
- URL: http://arxiv.org/abs/2002.02925v4
- Date: Sat, 3 Oct 2020 12:18:50 GMT
- Title: BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
- Authors: Canwen Xu and Wangchunshu Zhou and Tao Ge and Furu Wei and Ming Zhou
- Abstract summary: Our approach first divides the original BERT into several modules and builds their compact substitutes.
We randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules.
- Score: 113.48041857222431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel model compression approach to effectively
compress BERT by progressive module replacing. Our approach first divides the
original BERT into several modules and builds their compact substitutes. Then,
we randomly replace the original modules with their substitutes to train the
compact modules to mimic the behavior of the original modules. We progressively
increase the probability of replacement through the training. In this way, our
approach brings a deeper level of interaction between the original and compact
models. Compared to the previous knowledge distillation approaches for BERT
compression, our approach does not introduce any additional loss function. Our
approach outperforms existing knowledge distillation approaches on GLUE
benchmark, showing a new perspective of model compression.
Related papers
- MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs.
We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z) - EELBERT: Tiny Models through Dynamic Embeddings [0.28675177318965045]
EELBERT is an approach for compression of transformer-based models (e.g., BERT)
It is achieved by replacing the input embedding layer of the model with dynamic, i.e. on-the-fly, embedding computations.
We develop our smallest model UNO-EELBERT, which achieves a GLUE score within 4% of fully trained BERT-tiny.
arXiv Detail & Related papers (2023-10-31T03:28:08Z) - Module-wise Adaptive Distillation for Multimodality Foundation Models [125.42414892566843]
multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes.
One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer.
Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently.
arXiv Detail & Related papers (2023-10-06T19:24:00Z) - Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings
We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z) - Weakly Supervised Semantic Segmentation via Alternative Self-Dual
Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model.
We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z) - KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language
Models via Knowledge Distillation [5.8287955127529365]
We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition.
We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework.
Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
arXiv Detail & Related papers (2021-09-13T18:19:30Z) - ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks.
Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios.
We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture.
Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z) - LadaBERT: Lightweight Adaptation of BERT through Hybrid Model
Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus.
BERT is memory-intensive and leads to unsatisfactory latency of user requests.
We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.