Distiller: A Systematic Study of Model Distillation Methods in Natural
Language Processing
- URL: http://arxiv.org/abs/2109.11105v1
- Date: Thu, 23 Sep 2021 02:12:28 GMT
- Title: Distiller: A Systematic Study of Model Distillation Methods in Natural
Language Processing
- Authors: Haoyu He, Xingjian Shi, Jonas Mueller, Zha Sheng, Mu Li, George
Karypis
- Abstract summary: We aim to identify how different components in the KD pipeline affect the resulting performance.
We propose Distiller, a meta KD framework that combines a broad range of techniques across different stages of the KD pipeline.
We find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm.
- Score: 21.215122347801696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We aim to identify how different components in the KD pipeline affect the
resulting performance and how much the optimal KD pipeline varies across
different datasets/tasks, such as the data augmentation policy, the loss
function, and the intermediate representation for transferring the knowledge
between teacher and student. To tease apart their effects, we propose
Distiller, a meta KD framework that systematically combines a broad range of
techniques across different stages of the KD pipeline, which enables us to
quantify each component's contribution. Within Distiller, we unify commonly
used objectives for distillation of intermediate representations under a
universal mutual information (MI) objective and propose a class of MI-$\alpha$
objective functions with better bias/variance trade-off for estimating the MI
between the teacher and the student. On a diverse set of NLP datasets, the best
Distiller configurations are identified via large-scale hyperparameter
optimization. Our experiments reveal the following: 1) the approach used to
distill the intermediate representations is the most important factor in KD
performance, 2) among different objectives for intermediate distillation,
MI-$\alpha$ performs the best, and 3) data augmentation provides a large boost
for small training datasets or small student networks. Moreover, we find that
different datasets/tasks prefer different KD algorithms, and thus propose a
simple AutoDistiller algorithm that can recommend a good KD pipeline for a new
dataset.
Related papers
- Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation.
In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales.
Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Contextual Distillation Model for Diversified Recommendation [19.136439564988834]
Contextual Distillation Model (CDM) is an efficient recommendation model that addresses diversification.
We propose a contrastive context encoder that employs attention mechanisms to model both positive and negative contexts.
During inference, ranking is performed through a linear combination of the recommendation and student model scores.
arXiv Detail & Related papers (2024-06-13T11:55:40Z) - CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning [101.81127587760831]
Current fine-tuning methods build adapters widely of the context of downstream task to learn, or the context of important knowledge to maintain.
We propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters.
Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation.
arXiv Detail & Related papers (2024-06-07T19:10:35Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs.
We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Prediction-Guided Distillation for Dense Object Detection [7.5320132424481505]
We show that only a very small fraction of features within a ground-truth bounding box are responsible for a teacher's high detection performance.
We propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher.
Our proposed approach outperforms current state-of-the-art KD baselines on a variety of advanced one-stage detection architectures.
arXiv Detail & Related papers (2022-03-10T16:46:05Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - Modality-specific Distillation [30.190082262375395]
We propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets.
Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality.
Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses.
arXiv Detail & Related papers (2021-01-06T05:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.