Related papers: Distillation Scaling Laws

Distillation Scaling Laws

URL: http://arxiv.org/abs/2502.08606v2
Date: Fri, 25 Jul 2025 16:55:43 GMT
Title: Distillation Scaling Laws
Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb,
Abstract summary: We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher.<n>Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student.
Score: 9.828322497230053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
UNDO: Understanding Distillation as Optimization [9.100811514331498]
We introduce the UNDO: UNderstanding Distillation as Optimization framework.<n>Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales.<n> Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods.
arXiv Detail & Related papers (2025-04-03T12:18:51Z)
Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process.<n>We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z)
Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process. Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z)
PROD: Progressive Distillation for Dense Retrieval [65.83300173604384]
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. We propose PROD, a PROgressive Distillation method, for dense retrieval.
arXiv Detail & Related papers (2022-09-27T12:40:29Z)
Controlling the Quality of Distillation in Response-Based Network Compression [0.0]
The performance of a compressed network is governed by the quality of distillation. For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z)
Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood. We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.