BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance
- URL: http://arxiv.org/abs/2010.06133v1
- Date: Tue, 13 Oct 2020 02:53:52 GMT
- Title: BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance
- Authors: Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang and
Yaohong Jin
- Abstract summary: High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
- Score: 25.229624487344186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models (e.g., BERT) have achieved significant success in
various natural language processing (NLP) tasks. However, high storage and
computational costs obstruct pre-trained language models to be effectively
deployed on resource-constrained devices. In this paper, we propose a novel
BERT distillation method based on many-to-many layer mapping, which allows each
intermediate student layer to learn from any intermediate teacher layers. In
this way, our model can learn from different teacher layers adaptively for
various NLP tasks. %motivated by the intuition that different NLP tasks require
different levels of linguistic knowledge contained in the intermediate layers
of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the
minimum cumulative cost that must be paid to transform knowledge from teacher
network to student network. EMD enables the effective matching for many-to-many
layer mapping. %EMD can be applied to network layers with different sizes and
effectively measures semantic distance between the teacher network and student
network. Furthermore, we propose a cost attention mechanism to learn the layer
weights used in EMD automatically, which is supposed to further improve the
model's performance and accelerate convergence time. Extensive experiments on
GLUE benchmark demonstrate that our model achieves competitive performance
compared to strong competitors in terms of both accuracy and model compression.
Related papers
- Keep Decoding Parallel with Effective Knowledge Distillation from
Language Models to End-to-end Speech Recognisers [19.812986973537143]
This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers.
Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer.
Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
arXiv Detail & Related papers (2024-01-22T05:46:11Z) - Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation [29.952771954087602]
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos.
This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance.
arXiv Detail & Related papers (2023-08-07T17:07:48Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - Energy-Efficient and Federated Meta-Learning via Projected Stochastic
Gradient Ascent [79.58680275615752]
We propose an energy-efficient federated meta-learning framework.
We assume each task is owned by a separate agent, so a limited number of tasks is used to train a meta-model.
arXiv Detail & Related papers (2021-05-31T08:15:44Z) - Knowledge Distillation By Sparse Representation Matching [107.87219371697063]
We propose Sparse Representation Matching (SRM) to transfer intermediate knowledge from one Convolutional Network (CNN) to another by utilizing sparse representation.
We formulate as a neural processing block, which can be efficiently optimized using gradient descent and integrated into any CNN in a plug-and-play manner.
Our experiments demonstrate that is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.
arXiv Detail & Related papers (2021-03-31T11:47:47Z) - Cross-Layer Distillation with Semantic Calibration [26.59016826651437]
We propose Semantic for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer.
With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training.
arXiv Detail & Related papers (2020-12-06T11:16:07Z) - Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model.
This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs)
The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.