Related papers: Rethinking Selective Knowledge Distillation

Rethinking Selective Knowledge Distillation

URL: http://arxiv.org/abs/2602.01395v1
Date: Sun, 01 Feb 2026 18:58:27 GMT
Title: Rethinking Selective Knowledge Distillation
Authors: Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva,
Abstract summary: It remains unclear which importance signals, selection policies, and their interplay are most effective.<n>We introduce student-entropy-guided position selection (SE-KD) across the class and sample axes.<n>This approach yields complementary efficiency gains that make offline teacher caching feasible.
Score: 21.167064592056196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

Related papers

Distillation of Large Language Models via Concrete Score Matching [28.320219993420434]
Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference.<n>We propose a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set.<n> Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques.
arXiv Detail & Related papers (2025-09-30T06:21:28Z)
PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models [44.421994768941126]
We introduce ActiveKD, a framework that integrates active learning with knowledge distillation.<n>A key aspect of ActiveKD is the structured prediction bias of large vision-language models (VLMs)<n>We propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space.
arXiv Detail & Related papers (2025-06-01T08:54:37Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)<n>We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.<n>We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z)
Densely Distilling Cumulative Knowledge for Continual Learning [14.343655566551213]
Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting. We propose Dense Knowledge Distillation (DKD) to distill the cumulative knowledge of all the previous tasks. Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios.
arXiv Detail & Related papers (2024-05-16T05:37:06Z)
CrossKD: Cross-Head Knowledge Distillation for Object Detection [69.16346256926842]
Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. We present a prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. Our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods.
arXiv Detail & Related papers (2023-06-20T08:19:51Z)
Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model. We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z)
Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits. We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z)
Towards Reducing Labeling Cost in Deep Object Detection [61.010693873330446]
We propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector. Our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift.
arXiv Detail & Related papers (2021-06-22T16:53:09Z)
Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)
Self-Knowledge Distillation with Progressive Refinement of Targets [1.1470070927586016]
We propose a simple yet effective regularization method named progressive self-knowledge distillation (PS-KD) PS-KD progressively distills a model's own knowledge to soften hard targets during training. We show that PS-KD provides an effect of hard example mining by rescaling gradients according to difficulty in classifying examples.
arXiv Detail & Related papers (2020-06-22T04:06:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.