Related papers: Revisiting Knowledge Distillation: The Hidden Role of Dataset Size

Revisiting Knowledge Distillation: The Hidden Role of Dataset Size

URL: http://arxiv.org/abs/2510.15516v1
Date: Fri, 17 Oct 2025 10:40:45 GMT
Title: Revisiting Knowledge Distillation: The Hidden Role of Dataset Size
Authors: Giulia Lanzillotta, Felix Sarnthein, Gil Kur, Thomas Hofmann, Bobby He,
Abstract summary: Knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning.<n>Previous studies focus on two central aspects of distillation: model size, and generalisation.<n>In this work we study distillation in a third dimension: dataset size.<n>We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes.
Score: 37.68403967604424
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.

Related papers

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective [52.25797439810419]
Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored.<n>We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels.<n>We derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility.
arXiv Detail & Related papers (2026-02-03T11:16:59Z)
Distilling Diversity and Control in Diffusion Models [27.352868008401614]
Distilled diffusion models suffer from a critical limitation: reduced sample diversity compared to their base counterparts.<n>We show that despite this diversity loss, distilled models retain the fundamental concept representations of base models.<n>We introduce diversity distillation - a hybrid inference approach that strategically employs the base model for only the first critical timestep before transitioning to the efficient distilled model.
arXiv Detail & Related papers (2025-03-13T17:59:56Z)
Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.<n>Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.<n>Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z)
Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy. Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z)
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset. We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
A Comprehensive Study on Dataset Distillation: Performance, Privacy, Robustness and Fairness [8.432686179800543]
We conduct extensive experiments to evaluate current state-of-the-art dataset distillation methods. We successfully use membership inference attacks to show that privacy risks still remain. This work offers a large-scale benchmarking framework for dataset distillation evaluation.
arXiv Detail & Related papers (2023-05-05T08:19:27Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model. Many existing methods learn and distill the student model on labeled data only. We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z)
Visualizing the embedding space to explain the effect of knowledge distillation [5.678337324555035]
Recent research has found that knowledge distillation can be effective in reducing the size of a network. Despite these advances, it still is relatively unclear emphwhy this method works, that is, what the resulting student model does 'better'
arXiv Detail & Related papers (2021-10-09T07:04:26Z)
Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.