Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
- URL: http://arxiv.org/abs/2406.13114v1
- Date: Wed, 19 Jun 2024 00:01:14 GMT
- Title: Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
- Authors: Yuhang Zhou, Jing Zhu, Paiheng Xu, Xiaoyu Liu, Xiyao Wang, Danai Koutra, Wei Ai, Furong Huang,
- Abstract summary: Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models.
We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget.
BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
- Score: 33.21314371624318
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
Related papers
- DDK: Distilling Domain Knowledge for Efficient Large Language Models [40.839056203329136]
Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller language model.
This paper introduces DDK, which adjusts the composition of the distillation dataset according to the domain performance differences between the teacher and student models.
Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
arXiv Detail & Related papers (2024-07-23T03:47:28Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Densely Distilling Cumulative Knowledge for Continual Learning [14.343655566551213]
Continual learning, involving sequential training on diverse tasks, often faces catastrophic forgetting.
We propose Dense Knowledge Distillation (DKD) to distill the cumulative knowledge of all the previous tasks.
Our DKD outperforms recent state-of-the-art baselines across diverse benchmarks and scenarios.
arXiv Detail & Related papers (2024-05-16T05:37:06Z) - Improving In-context Learning via Bidirectional Alignment [41.214003703218914]
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL)
We propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models.
Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss.
arXiv Detail & Related papers (2023-12-28T15:02:03Z) - Dynamic Sub-graph Distillation for Robust Semi-supervised Continual
Learning [52.046037471678005]
We focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories.
We propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning.
arXiv Detail & Related papers (2023-12-27T04:40:12Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Optimal Decision Diagrams for Classification [68.72078059880018]
We study the training of optimal decision diagrams from a mathematical programming perspective.
We introduce a novel mixed-integer linear programming model for training.
We show how this model can be easily extended for fairness, parsimony, and stability notions.
arXiv Detail & Related papers (2022-05-28T18:31:23Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.