Related papers: Improving Task-Agnostic BERT Distillation with Layer Mapping Search

Improving Task-Agnostic BERT Distillation with Layer Mapping Search

URL: http://arxiv.org/abs/2012.06153v1
Date: Fri, 11 Dec 2020 06:29:58 GMT
Title: Improving Task-Agnostic BERT Distillation with Layer Mapping Search
Authors: Xiaoqi Jiao, Huating Chang, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang and Qun Liu
Abstract summary: We show that layer-level supervision is crucial to the performance of the student BERT model. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model.
Score: 43.7650740369353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.

Related papers

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification [7.005068872406135]
Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. We present a novel approach for exploiting the multilayered nature of pretrained models for ASV. We show how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.
arXiv Detail & Related papers (2024-09-12T05:55:32Z)
Noisy Node Classification by Bi-level Optimization based Multi-teacher Distillation [17.50773984154023]
We propose a new multi-teacher distillation method based on bi-level optimization (namely BO-NNC) to conduct noisy node classification on the graph data. Specifically, we first employ multiple self-supervised learning methods to train diverse teacher models, and then aggregate their predictions through a teacher weight matrix. Furthermore, we design a new bi-level optimization strategy to dynamically adjust the teacher weight matrix based on the training progress of the student model.
arXiv Detail & Related papers (2024-04-27T12:19:08Z)
SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning [14.480769476843886]
We introduce SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class.
arXiv Detail & Related papers (2024-02-26T18:56:42Z)
Effective Whole-body Pose Estimation with Two-stages Distillation [52.92064408970796]
Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. We present a two-stage pose textbfDistillation for textbfWhole-body textbfPose estimators, named textbfDWPose, to improve their effectiveness and efficiency.
arXiv Detail & Related papers (2023-07-29T03:49:28Z)
Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost. We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z)
Active Teacher for Semi-Supervised Object Detection [80.10937030195228]
We propose a novel algorithm called Active Teacher for semi-supervised object detection (SSOD) Active Teacher extends the teacher-student framework to an iterative version, where the label set is partially and gradually augmented by evaluating three key factors of unlabeled examples. With this design, Active Teacher can maximize the effect of limited label information while improving the quality of pseudo-labels.
arXiv Detail & Related papers (2023-03-15T03:59:27Z)
RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation [24.951887361152988]
We propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
arXiv Detail & Related papers (2021-09-21T13:21:13Z)
Follow Your Path: a Progressive Method for Knowledge Distillation [23.709919521355936]
We propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Experiments on both image and text datasets show that our proposed ProKT consistently achieves superior performance compared to other existing knowledge distillation methods.
arXiv Detail & Related papers (2021-07-20T07:44:33Z)
MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation [153.56211546576978]
In this work, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator. We can employ the meta-learning technique to optimize this label generator. The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012.
arXiv Detail & Related papers (2020-08-27T13:04:27Z)
Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.