Related papers: Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

URL: http://arxiv.org/abs/2109.11105v1
Date: Thu, 23 Sep 2021 02:12:28 GMT
Title: Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing
Authors: Haoyu He, Xingjian Shi, Jonas Mueller, Zha Sheng, Mu Li, George Karypis
Abstract summary: We aim to identify how different components in the KD pipeline affect the resulting performance. We propose Distiller, a meta KD framework that combines a broad range of techniques across different stages of the KD pipeline. We find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm.
Score: 21.215122347801696
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component's contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MI-$\alpha$ objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyperparameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-$\alpha$ performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

Related papers

Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime [9.749891245059596]
We demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance.<n>Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points.<n>We theoretically show that the approximation error of neural networks decreases as $h_min$ increases.
arXiv Detail & Related papers (2025-06-30T17:58:30Z)
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence [131.41894248194995]
We propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner.<n>Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM)
arXiv Detail & Related papers (2025-06-16T07:55:14Z)
Efficient Data Selection at Scale via Influence Distillation [53.03573620682107]
This paper introduces Influence Distillation, a mathematicallyjustified framework for data selection.<n>By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data.<n>Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5times$ faster selection.
arXiv Detail & Related papers (2025-05-25T09:08:00Z)
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation [37.38634940034755]
This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation.<n>We rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets.
arXiv Detail & Related papers (2025-05-24T15:54:19Z)
Transferable text data distillation by trajectory matching [27.826518926355295]
The data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set. In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS.
arXiv Detail & Related papers (2025-04-14T02:39:26Z)
ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning [29.001249598245]
We introduce Reward-Oriented inStruction data sElection to optimize data selection for task-specific instruction tuning. ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points.
arXiv Detail & Related papers (2024-12-01T01:01:09Z)
Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation. In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales. Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Contextual Distillation Model for Diversified Recommendation [19.136439564988834]
Contextual Distillation Model (CDM) is an efficient recommendation model that addresses diversification. We propose a contrastive context encoder that employs attention mechanisms to model both positive and negative contexts. During inference, ranking is performed through a linear combination of the recommendation and student model scores.
arXiv Detail & Related papers (2024-06-13T11:55:40Z)
CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning [101.81127587760831]
Current fine-tuning methods build adapters widely of the context of downstream task to learn, or the context of important knowledge to maintain. We propose CorDA, a Context-oriented Decomposition Adaptation method that builds learnable task-aware adapters. Our method enables two options, the knowledge-preserved adaptation and the instruction-previewed adaptation.
arXiv Detail & Related papers (2024-06-07T19:10:35Z)
AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z)
Class Anchor Margin Loss for Content-Based Image Retrieval [97.81742911657497]
We propose a novel repeller-attractor loss that falls in the metric learning paradigm, yet directly optimize for the L2 metric without the need of generating pairs. We evaluate the proposed objective in the context of few-shot and full-set training on the CBIR task, by using both convolutional and transformer architectures.
arXiv Detail & Related papers (2023-06-01T12:53:10Z)
Improving Knowledge Distillation via Regularizing Feature Norm and Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z)
Prediction-Guided Distillation for Dense Object Detection [7.5320132424481505]
We show that only a very small fraction of features within a ground-truth bounding box are responsible for a teacher's high detection performance. We propose Prediction-Guided Distillation (PGD), which focuses distillation on these key predictive regions of the teacher. Our proposed approach outperforms current state-of-the-art KD baselines on a variety of advanced one-stage detection architectures.
arXiv Detail & Related papers (2022-03-10T16:46:05Z)
EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur. We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data. We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z)
Modality-specific Distillation [30.190082262375395]
We propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets. Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality. Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses.
arXiv Detail & Related papers (2021-01-06T05:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.