Related papers: BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

URL: http://arxiv.org/abs/2010.06133v1
Date: Tue, 13 Oct 2020 02:53:52 GMT
Title: BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance
Authors: Jianquan Li, Xiaokang Liu, Honghong Zhao, Ruifeng Xu, Min Yang and Yaohong Jin
Abstract summary: High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. We propose a novel BERT distillation method based on many-to-many layer mapping. Our model can learn from different teacher layers adaptively for various NLP tasks.
Score: 25.229624487344186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices. In this paper, we propose a novel BERT distillation method based on many-to-many layer mapping, which allows each intermediate student layer to learn from any intermediate teacher layers. In this way, our model can learn from different teacher layers adaptively for various NLP tasks. %motivated by the intuition that different NLP tasks require different levels of linguistic knowledge contained in the intermediate layers of BERT. In addition, we leverage Earth Mover's Distance (EMD) to compute the minimum cumulative cost that must be paid to transform knowledge from teacher network to student network. EMD enables the effective matching for many-to-many layer mapping. %EMD can be applied to network layers with different sizes and effectively measures semantic distance between the teacher network and student network. Furthermore, we propose a cost attention mechanism to learn the layer weights used in EMD automatically, which is supposed to further improve the model's performance and accelerate convergence time. Extensive experiments on GLUE benchmark demonstrate that our model achieves competitive performance compared to strong competitors in terms of both accuracy and model compression.

Related papers

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models [81.74999702045339]
Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
arXiv Detail & Related papers (2024-12-19T04:51:06Z)
Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers [19.812986973537143]
This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. Our method shows that language model (LM) information can be more effectively distilled into an ASR model using both the intermediate layers and the final layer. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding.
arXiv Detail & Related papers (2024-01-22T05:46:11Z)
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation [29.952771954087602]
Temporal Sentence Grounding in Videos (TSGV) aims to detect the event timestamps described by the natural language query from untrimmed videos. This paper discusses the challenge of achieving efficient computation in TSGV models while maintaining high performance.
arXiv Detail & Related papers (2023-08-07T17:07:48Z)
Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Existing MIM models conduct reconstruction task only at the top layer of encoder. We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression. We show that MT-BERT can train high-quality student model from multiple teacher PLMs. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z)
Energy-Efficient and Federated Meta-Learning via Projected Stochastic Gradient Ascent [79.58680275615752]
We propose an energy-efficient federated meta-learning framework. We assume each task is owned by a separate agent, so a limited number of tasks is used to train a meta-model.
arXiv Detail & Related papers (2021-05-31T08:15:44Z)
Knowledge Distillation By Sparse Representation Matching [107.87219371697063]
We propose Sparse Representation Matching (SRM) to transfer intermediate knowledge from one Convolutional Network (CNN) to another by utilizing sparse representation. We formulate as a neural processing block, which can be efficiently optimized using gradient descent and integrated into any CNN in a plug-and-play manner. Our experiments demonstrate that is robust to architectural differences between the teacher and student networks, and outperforms other KD techniques across several datasets.
arXiv Detail & Related papers (2021-03-31T11:47:47Z)
Cross-Layer Distillation with Semantic Calibration [26.59016826651437]
We propose Semantic for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer. With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training.
arXiv Detail & Related papers (2020-12-06T11:16:07Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.