Related papers: Practical Insights into Knowledge Distillation for Pre-Trained Models

Practical Insights into Knowledge Distillation for Pre-Trained Models

URL: http://arxiv.org/abs/2402.14922v2
Date: Tue, 22 Jul 2025 10:21:30 GMT
Title: Practical Insights into Knowledge Distillation for Pre-Trained Models
Authors: Norah Alballa, Ahmed M. Abdelmoniem, Marco Canini,
Abstract summary: This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models.<n>Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application is lacking.<n>Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD.
Score: 7.248285042377168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models, an emerging field in knowledge transfer with significant implications for distributed training and federated learning environments. These environments benefit from reduced communication demands and accommodate various model architectures. Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application in these scenarios is lacking. Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD. We assess these methods across various data distribution strategies to identify the most effective contexts for each. Through detailed examination of hyperparameter tuning, informed by extensive grid search evaluations, we pinpoint when adjustments are crucial to enhance model performance. This paper sheds light on optimal hyperparameter settings for distinct data partitioning scenarios and investigates KD's role in improving federated learning by minimizing communication rounds and expediting the training process. By filling a notable void in current research, our findings serve as a practical framework for leveraging KD in pre-trained models within collaborative and federated learning frameworks.

Related papers

An Empirical Study of Knowledge Distillation for Code Understanding Tasks [19.64130505527951]
Knowledge distillation (KD) addresses limitations by transferring knowledge from large teacher models to compact student models.<n>This paper systematically investigates the effectiveness and usage of KD in code understanding tasks.
arXiv Detail & Related papers (2025-08-21T10:24:48Z)
KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning [72.53466291156604]
We present textbfKDRL, a textitunified post-training framework that jointly optimize a reasoning model through teacher supervision (KD) and self-exploration (RL)<n>We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance.
arXiv Detail & Related papers (2025-06-02T19:46:41Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Applications of Knowledge Distillation in Remote Sensing: A Survey [3.481234252899159]
Knowledge distillation (KD) is a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student) The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options. The review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions.
arXiv Detail & Related papers (2024-09-18T16:30:49Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs) We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence. We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
A Survey on Knowledge Distillation of Large Language Models [99.11900233108487]
Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities to open-source models. This paper presents a comprehensive survey of KD's role within the realm of Large Language Models (LLMs)
arXiv Detail & Related papers (2024-02-20T16:17:37Z)
ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift [7.256448072529497]
Knowledge Distillation (KD) transfers knowledge from large models to small models and has recently achieved remarkable success.<n>However, the reliability of existing KD methods in real-world applications, especially under distribution shift, remains underexplored.<n>We propose a unified and systematic framework textscShiftKD to benchmark KD against two general distributional shifts.
arXiv Detail & Related papers (2023-12-25T10:43:31Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models. Most existing KD techniques rely on Kullback-Leibler (KL) divergence. We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher) In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.