Related papers: Towards a theory of model distillation

Towards a theory of model distillation

URL: http://arxiv.org/abs/2403.09053v2
Date: Sat, 4 May 2024 19:52:03 GMT
Title: Towards a theory of model distillation
Authors: Enric Boix-Adsera,
Abstract summary: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original. We show how to efficiently distill neural networks into succinct, explicit decision tree representations. We prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.

Related papers

Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective [52.25797439810419]
Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored.<n>We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels.<n>We derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility.
arXiv Detail & Related papers (2026-02-03T11:16:59Z)
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces [17.627125013326175]
'SubDistill' is a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer.<n>Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
arXiv Detail & Related papers (2026-01-09T16:28:55Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process. We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z)
Efficient Knowledge Injection in LLMs via Self-Distillation [50.24554628642021]
This paper proposes utilizing prompt distillation to internalize new factual knowledge from free-form documents.<n>We show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG.
arXiv Detail & Related papers (2024-12-19T15:44:01Z)
Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy. Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z)
Online Distillation for Pseudo-Relevance Feedback [16.523925354318983]
We investigate whether a model for a specific query can be effectively distilled from neural re-ranking results. We find that a lexical model distilled online can reasonably replicate the re-ranking of a neural model. More importantly, these models can be used as queries that execute efficiently on indexes.
arXiv Detail & Related papers (2023-06-16T07:26:33Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
DETRDistill: A Universal Knowledge Distillation Framework for DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations. Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z)
Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation [72.70058049274664]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision) Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation.
arXiv Detail & Related papers (2022-10-25T07:07:54Z)
Self-Knowledge Distillation via Dropout [0.7883397954991659]
We propose a simple and effective self-knowledge distillation method using a dropout (SD-Dropout) Our method does not require any additional trainable modules, does not rely on data, and requires only simple operations.
arXiv Detail & Related papers (2022-08-11T05:08:55Z)
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders. Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z)
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model. We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory. We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.