Towards the Law of Capacity Gap in Distilling Language Models
- URL: http://arxiv.org/abs/2311.07052v3
- Date: Thu, 25 Jul 2024 03:20:15 GMT
- Title: Towards the Law of Capacity Gap in Distilling Language Models
- Authors: Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao,
- Abstract summary: Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one.
textscMiniMA is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
- Score: 13.630180187069904
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted \textit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.
Related papers
- Distillation Scaling Laws [9.828322497230053]
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher.<n>Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student.
arXiv Detail & Related papers (2025-02-12T17:52:47Z) - Who Taught You That? Tracing Teachers in Model Distillation [23.566776089005963]
We ask: Can we identify a students' teacher based on its outputs?
We consider practical task distillation targets including summarization, question answering, and instruction-following.
We design discriminative models that operate over lexical features.
arXiv Detail & Related papers (2025-02-10T16:48:56Z) - On Teacher Hacking in Language Model Distillation [61.19867259475047]
We investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation.
This could arise because the teacher LM is itself an imperfect approximation of the true distribution.
Online data generation techniques effectively mitigates teacher hacking.
arXiv Detail & Related papers (2025-02-04T19:26:28Z) - Pre-training Distillation for Large Language Models: A Design Space Exploration [54.67324039434781]
Pre-training distillation aims to transfer knowledge from a large teacher model to a smaller student model.
We conduct experiments to explore the design space of pre-training distillation and find better configurations.
We hope our exploration of the design space will inform future practices in pre-training distillation.
arXiv Detail & Related papers (2024-10-21T17:16:13Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z) - Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation [23.736611338497244]
TinyLLM is a new knowledge distillation paradigm to learn a small student LLM from multiple large teacher LLMs.
We introduce an in-context example generator and a teacher-forcing Chain-of-Thought strategy to ensure that the rationales are accurate and grounded in contextually appropriate scenarios.
Results show that TinyLLM can outperform large teacher LLMs significantly, despite a considerably smaller model size.
arXiv Detail & Related papers (2024-02-07T06:48:24Z) - Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and
Layers [73.28459749681879]
This paper focuses on LLaMA, a prominent open-source foundational model in natural language processing.
Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding.
We unveil several key and uncommon findings based on the designed probing tasks.
arXiv Detail & Related papers (2023-12-07T14:50:41Z) - Triplet Knowledge Distillation [73.39109022280878]
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn.
To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD.
arXiv Detail & Related papers (2023-05-25T12:12:31Z) - Lifting the Curse of Capacity Gap in Distilling Language Models [19.370268407987652]
We propose a mixture of minimal experts (MiniMoE) which imposes extra parameters to the student but introduces almost no additional inference compute.
With a compression rate as much as $sim$50$times$, MiniMoE preserves $sim$95% GLUE score of the teacher.
arXiv Detail & Related papers (2023-05-20T07:30:55Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z) - Subclass Distillation [94.18870689772544]
We show that it is possible to transfer most of the generalization ability of a teacher to a student.
For datasets where there are known, natural subclasses we demonstrate that the teacher learns similar subclasses.
For clickthrough datasets where the subclasses are unknown we demonstrate that subclass distillation allows the student to learn faster and better.
arXiv Detail & Related papers (2020-02-10T16:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.