Related papers: Towards the Law of Capacity Gap in Distilling Language Models

Towards the Law of Capacity Gap in Distilling Language Models

URL: http://arxiv.org/abs/2311.07052v4
Date: Wed, 30 Jul 2025 16:00:53 GMT
Title: Towards the Law of Capacity Gap in Distilling Language Models
Authors: Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu,
Abstract summary: Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one.<n>As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one.<n>This paper provides the textitlaw of capacity gap inducted from a preliminary study on distilling a broad range of small-scale (3B) LMs.
Score: 17.94199083434851
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.

Related papers

Distillation Scaling Laws [9.828322497230053]
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher.<n>Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student.
arXiv Detail & Related papers (2025-02-12T17:52:47Z)
Who Taught You That? Tracing Teachers in Model Distillation [23.566776089005963]
We ask: Can we identify a students' teacher based on its outputs? We consider practical task distillation targets including summarization, question answering, and instruction-following. We design discriminative models that operate over lexical features.
arXiv Detail & Related papers (2025-02-10T16:48:56Z)
On Teacher Hacking in Language Model Distillation [61.19867259475047]
We investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. Online data generation techniques effectively mitigates teacher hacking.
arXiv Detail & Related papers (2025-02-04T19:26:28Z)
Pre-training Distillation for Large Language Models: A Design Space Exploration [54.67324039434781]
Pre-training distillation aims to transfer knowledge from a large teacher model to a smaller student model. We conduct experiments to explore the design space of pre-training distillation and find better configurations. We hope our exploration of the design space will inform future practices in pre-training distillation.
arXiv Detail & Related papers (2024-10-21T17:16:13Z)
LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z)
Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation [23.736611338497244]
TinyLLM is a new knowledge distillation paradigm to learn a small student LLM from multiple large teacher LLMs. We introduce an in-context example generator and a teacher-forcing Chain-of-Thought strategy to ensure that the rationales are accurate and grounded in contextually appropriate scenarios. Results show that TinyLLM can outperform large teacher LLMs significantly, despite a considerably smaller model size.
arXiv Detail & Related papers (2024-02-07T06:48:24Z)
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers [73.28459749681879]
This paper focuses on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding. We unveil several key and uncommon findings based on the designed probing tasks.
arXiv Detail & Related papers (2023-12-07T14:50:41Z)
Triplet Knowledge Distillation [73.39109022280878]
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn. To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD.
arXiv Detail & Related papers (2023-05-25T12:12:31Z)
Lifting the Curse of Capacity Gap in Distilling Language Models [19.370268407987652]
We propose a mixture of minimal experts (MiniMoE) which imposes extra parameters to the student but introduces almost no additional inference compute. With a compression rate as much as $sim$50$times$, MiniMoE preserves $sim$95% GLUE score of the teacher.
arXiv Detail & Related papers (2023-05-20T07:30:55Z)
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression. We show that MT-BERT can train high-quality student model from multiple teacher PLMs. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z)
Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one. We investigate the capacity gap problem by study the gap of confidence between teacher and student. We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z)
Contrastive Distillation on Intermediate Representations for Language Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
Subclass Distillation [94.18870689772544]
We show that it is possible to transfer most of the generalization ability of a teacher to a student. For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses. For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.
arXiv Detail & Related papers (2020-02-10T16:45:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.