KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
- URL: http://arxiv.org/abs/2603.01875v1
- Date: Mon, 02 Mar 2026 13:54:19 GMT
- Title: KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models
- Authors: Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu,
- Abstract summary: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones.<n>We present textbfKDFlow, which features a decoupled architecture and employs SGLang for teacher inference.<n>Our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly and user-friendly APIs.
- Score: 45.26304817538935
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: https://github.com/songmzhang/KDFlow
Related papers
- An Empirical Study of Knowledge Distillation for Code Understanding Tasks [19.64130505527951]
Knowledge distillation (KD) addresses limitations by transferring knowledge from large teacher models to compact student models.<n>This paper systematically investigates the effectiveness and usage of KD in code understanding tasks.
arXiv Detail & Related papers (2025-08-21T10:24:48Z) - A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models.<n>The current white-box KD framework exhibits two limitations.<n>We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z) - Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)<n>We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.<n>We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Dual-Space Knowledge Distillation for Large Language Models [39.798007795604676]
We propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD.
Our framework is not only compatible with various distance functions for KD like the current framework, but also supports KD between any two LLMs regardless of their vocabularies.
arXiv Detail & Related papers (2024-06-25T07:25:15Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - Neural Architecture Search for Effective Teacher-Student Knowledge
Transfer in Language Models [21.177293243968744]
Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments.
We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task distillation from a multilingual teacher.
Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance.
arXiv Detail & Related papers (2023-03-16T20:39:44Z) - Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT [20.732095457775138]
Knowledge Distillation (KD) is one of the widely known methods for model compression.
Pea-KD consists of two main parts: Shuffled Sharing (SPS) and Pretraining with Teacher's Predictions (PTP)
arXiv Detail & Related papers (2020-09-30T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.