Related papers: BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

URL: http://arxiv.org/abs/2002.02925v4
Date: Sat, 3 Oct 2020 12:18:50 GMT
Title: BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
Authors: Canwen Xu and Wangchunshu Zhou and Tao Ge and Furu Wei and Ming Zhou
Abstract summary: Our approach first divides the original BERT into several modules and builds their compact substitutes. We randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules.
Score: 113.48041857222431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

Related papers

Towards Compatible Fine-tuning for Vision-Language Model Updates [114.25776195225494]
Class-conditioned Context Optimization (ContCoOp) integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Our experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
arXiv Detail & Related papers (2024-12-30T12:06:27Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions. Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
Is Modularity Transferable? A Case Study through the Lens of Knowledge Distillation [59.37775534633868]
We present an extremely straightforward approach to transferring pre-trained, task-specific PEFT modules between same-family PLMs. We also propose a method that allows the transfer of modules between incompatible PLMs without any change in the inference complexity.
arXiv Detail & Related papers (2024-03-27T17:50:00Z)
EELBERT: Tiny Models through Dynamic Embeddings [0.28675177318965045]
EELBERT is an approach for compression of transformer-based models (e.g., BERT) It is achieved by replacing the input embedding layer of the model with dynamic, i.e. on-the-fly, embedding computations. We develop our smallest model UNO-EELBERT, which achieves a GLUE score within 4% of fully trained BERT-tiny.
arXiv Detail & Related papers (2023-10-31T03:28:08Z)
Module-wise Adaptive Distillation for Multimodality Foundation Models [125.42414892566843]
multimodal foundation models have demonstrated remarkable generalizability but pose challenges for deployment due to their large sizes. One effective approach to reducing their sizes is layerwise distillation, wherein small student models are trained to match the hidden representations of large teacher models at each layer. Motivated by our observation that certain architecture components, referred to as modules, contribute more significantly to the student's performance than others, we propose to track the contributions of individual modules by recording the loss decrement after distillation each module and choose the module with a greater contribution to distill more frequently.
arXiv Detail & Related papers (2023-10-06T19:24:00Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
Weakly Supervised Semantic Segmentation via Alternative Self-Dual Teaching [82.71578668091914]
This paper establishes a compact learning framework that embeds the classification and mask-refinement components into a unified deep model. We propose a novel alternative self-dual teaching (ASDT) mechanism to encourage high-quality knowledge interaction.
arXiv Detail & Related papers (2021-12-17T11:56:56Z)
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation [5.8287955127529365]
We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.
arXiv Detail & Related papers (2021-09-13T18:19:30Z)
ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques [10.983311133796745]
Pre-trained language models of the BERT family have defined the state-of-the-arts in a wide range of NLP tasks. Performance of BERT-based models is mainly driven by the enormous amount of parameters, which hinders their application to resource-limited scenarios. We introduce three kinds of compression methods (weight pruning, low-rank factorization and knowledge distillation) and explore a range of designs concerning model architecture. Our best compressed model, dubbed Refined BERT cOmpreSsion with InTegrAted techniques (ROSITA), is $7.5 times$ smaller than
arXiv Detail & Related papers (2021-03-21T11:33:33Z)
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression [21.03685890385275]
BERT is a cutting-edge language representation model pre-trained by a large corpus. BERT is memory-intensive and leads to unsatisfactory latency of user requests. We propose a hybrid solution named LadaBERT, which combines the advantages of different model compression methods.
arXiv Detail & Related papers (2020-04-08T17:18:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.