Related papers: Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models

Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models

URL: http://arxiv.org/abs/2508.06135v1
Date: Fri, 08 Aug 2025 08:55:53 GMT
Title: Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models
Authors: Lingyuan Liu, Mengxiang Zhang,
Abstract summary: Knowledge Distillation (KD) is a technique for compressing large language models (LLMs) into compact, efficient student models.<n>We propose Selective Reflection Distillation (SRD), a novel data curation framework.<n>As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD's consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.

Related papers

Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework [0.0]
Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model's capabilities to a smaller student model.<n>Existing KD methods for LLMs often fail to prevent significant shifts in the student model's distribution during training.<n>We propose a novel, plug-in curriculum learning framework inspired by the strength training principle of "progressive overload"
arXiv Detail & Related papers (2025-06-06T02:48:38Z)
KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning [72.53466291156604]
We present textbfKDRL, a textitunified post-training framework that jointly optimize a reasoning model through teacher supervision (KD) and self-exploration (RL)<n>We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance.
arXiv Detail & Related papers (2025-06-02T19:46:41Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
A Psychology-based Unified Dynamic Framework for Curriculum Learning [5.410910735259908]
This paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF) We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC) We propose a Dynamic Data Selection via Model Ability Estimation (DDS-MAE) strategy to schedule the appropriate amount of data during model training.
arXiv Detail & Related papers (2024-08-09T20:30:37Z)
Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD) RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals. Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z)
DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models. DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z)
Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.<n>Most existing KD techniques rely on Kullback-Leibler (KL) divergence.<n>We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z)
Data Upcycling Knowledge Distillation for Image Super-Resolution [25.753554952896096]
Knowledge distillation (KD) compresses deep neural networks by transferring task-related knowledge from pre-trained teacher models to compact student models. We present the Data Upcycling Knowledge Distillation (DUKD) to transfer the teacher model's knowledge to the student model through the upcycled in-domain data derived from training data.
arXiv Detail & Related papers (2023-09-25T14:13:26Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.