Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
- URL: http://arxiv.org/abs/2601.05913v1
- Date: Fri, 09 Jan 2026 16:28:55 GMT
- Title: Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
- Authors: Pattarawat Chormai, Ali Hashemi, Klaus-Robert Müller, Grégoire Montavon,
- Abstract summary: 'SubDistill' is a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer.<n>Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
- Score: 17.627125013326175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce 'SubDistill', a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
Related papers
- On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z) - Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation [50.784080714897776]
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models.<n>We show that KD induces a trade-off between precision and recall in the student model.<n>Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
arXiv Detail & Related papers (2025-05-19T13:39:47Z) - Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - AMD: Automatic Multi-step Distillation of Large-scale Vision Models [39.70559487432038]
We present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression.
An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance.
arXiv Detail & Related papers (2024-07-05T01:35:42Z) - Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task.
Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models.
The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z) - Education distillation:getting student models to learn in shcools [11.017346789801238]
This paper introduces a new knowledge distillation method, called education distillation (ED)<n>ED mimics the educational stages of primary school, middle school, and university and designs teaching reference blocks.<n> Experimental results on the CIFAR100, Tiny Imagenet, Caltech and Food-101 datasets show that the teaching reference blocks can effectively avoid the problem of forgetting.
arXiv Detail & Related papers (2023-11-23T05:20:18Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.