HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers
- URL: http://arxiv.org/abs/2302.09632v1
- Date: Sun, 19 Feb 2023 17:37:24 GMT
- Title: HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers
- Authors: Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bin Yin and Tuo
Zhao
- Abstract summary: This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
- Score: 49.79405257763856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has been shown to be a powerful model compression
approach to facilitate the deployment of pre-trained language models in
practice. This paper focuses on task-agnostic distillation. It produces a
compact pre-trained model that can be easily fine-tuned on various tasks with
small computational costs and memory footprints. Despite the practical
benefits, task-agnostic distillation is challenging. Since the teacher model
has a significantly larger capacity and stronger representation power than the
student model, it is very difficult for the student to produce predictions that
match the teacher's over a massive amount of open-domain training data. Such a
large prediction discrepancy often diminishes the benefits of knowledge
distillation. To address this challenge, we propose Homotopic Distillation
(HomoDistil), a novel task-agnostic distillation approach equipped with
iterative pruning. Specifically, we initialize the student model from the
teacher model, and iteratively prune the student's neurons until the target
width is reached. Such an approach maintains a small discrepancy between the
teacher's and student's predictions throughout the distillation process, which
ensures the effectiveness of knowledge transfer. Extensive experiments
demonstrate that HomoDistil achieves significant improvements on existing
baselines.
Related papers
- Progressive distillation induces an implicit curriculum [44.528775476168654]
A better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several teachers.
One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher.
Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning.
arXiv Detail & Related papers (2024-10-07T19:49:24Z) - Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation [25.58020699235669]
Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation.
In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for simple'' scenarios.
We introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism.
arXiv Detail & Related papers (2024-04-23T08:29:56Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model.
This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.