Distillation Scaling Laws
- URL: http://arxiv.org/abs/2502.08606v1
- Date: Wed, 12 Feb 2025 17:52:47 GMT
- Title: Distillation Scaling Laws
- Authors: Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb,
- Abstract summary: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher.
Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance.
- Score: 9.828322497230053
- License:
- Abstract: We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated with using distillation at scale; compute allocation for both the teacher and student models can now be done to maximize student performance. We provide compute optimal distillation recipes for when 1) a teacher exists, or 2) a teacher needs training. If many students are to be distilled, or a teacher already exists, distillation outperforms supervised pretraining until a compute level which grows predictably with student size. If one student is to be distilled and a teacher also needs training, supervised learning should be done instead. Additionally, we provide insights across our large scale study of distillation, which increase our understanding of distillation and inform experimental design.
Related papers
- Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.
Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process.
We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - PROD: Progressive Distillation for Dense Retrieval [65.83300173604384]
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student.
We propose PROD, a PROgressive Distillation method, for dense retrieval.
arXiv Detail & Related papers (2022-09-27T12:40:29Z) - Controlling the Quality of Distillation in Response-Based Network
Compression [0.0]
The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.