f-Divergence Minimization for Sequence-Level Knowledge Distillation
- URL: http://arxiv.org/abs/2307.15190v1
- Date: Thu, 27 Jul 2023 20:39:06 GMT
- Title: f-Divergence Minimization for Sequence-Level Knowledge Distillation
- Authors: Yuqiao Wen, Zichao Li, Wenyu Du, Lili Mou
- Abstract summary: Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one.
We propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function.
Experiments across four datasets show that our methods outperform existing KD approaches.
- Score: 23.513372304624486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is the process of transferring knowledge from a
large model to a small one. It has gained increasing attention in the natural
language processing community, driven by the demands of compressing
ever-growing language models. In this work, we propose an f-DISTILL framework,
which formulates sequence-level knowledge distillation as minimizing a
generalized f-divergence function. We propose four distilling variants under
our framework and show that existing SeqKD and ENGINE approaches are
approximations of our f-DISTILL methods. We further derive step-wise
decomposition for our f-DISTILL, reducing intractable sequence-level divergence
to word-level losses that can be computed in a tractable manner. Experiments
across four datasets show that our methods outperform existing KD approaches,
and that our symmetric distilling losses can better force the student to learn
from the teacher distribution.
Related papers
- Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation [50.19746127327559]
We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions.<n> Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models.
arXiv Detail & Related papers (2026-02-24T11:54:06Z) - On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z) - CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning [24.299542011394298]
Few-shot class-incremental learning (FSCIL) receives significant attention from the public.<n>We propose a framework termed textbfConstrained textbfDataset textbfDistillation (textbfCD$2$) to facilitate FSCIL.
arXiv Detail & Related papers (2026-01-13T13:01:14Z) - Knowledge Distillation of Uncertainty using Deep Latent Factor Model [10.148306002388196]
We introduce a new method of distribution distillation called Gaussian distillation.<n>It estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF)<n>By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines.
arXiv Detail & Related papers (2025-10-22T06:46:59Z) - Knowledge distillation through geometry-aware representational alignment [3.901188865224763]
We show that existing feature distillation methods cannot capture the feature structure, even under zero loss.<n>We then motivate the use of Procrustes distance and the Frobenius norm of Feature Gram Matrix, distances already common in the context of measuring representational alignment.<n>We show that feature distillation through our method showcases statistically significant improvement in distillation performance across language models families.
arXiv Detail & Related papers (2025-09-27T09:59:46Z) - Train with Perturbation, Infer after Merging: A Two-Stage Framework for Continual Learning [57.514786046966265]
We propose textbfPerturb-and-Merge (P&M), a novel continual learning framework that integrates model merging into the CL paradigm to mitigate forgetting.<n>Our proposed approach achieves state-of-the-art performance on several continual learning benchmark datasets.
arXiv Detail & Related papers (2025-05-28T14:14:19Z) - Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution [81.81748032199813]
We propose a Distillation-Free One-Step Diffusion model.
Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training.
We improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model's ability to generate fine details.
arXiv Detail & Related papers (2024-10-05T16:41:36Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection [47.0507287491627]
We propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection.
By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model.
Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources.
arXiv Detail & Related papers (2024-06-11T06:51:02Z) - Regularized DeepIV with Model Selection [72.17508967124081]
Regularized DeepIV (RDIV) regression can converge to the least-norm IV solution.
Our method matches the current state-of-the-art convergence rate.
arXiv Detail & Related papers (2024-03-07T05:38:56Z) - Learning to Maximize Mutual Information for Chain-of-Thought Distillation [13.660167848386806]
Distilling Step-by-Step(DSS) has demonstrated promise by imbuing smaller models with the superior reasoning capabilities of their larger counterparts.
However, DSS overlooks the intrinsic relationship between the two training tasks, leading to ineffective integration of CoT knowledge with the task of label prediction.
We propose a variational approach to solve this problem using a learning-based method.
arXiv Detail & Related papers (2024-03-05T22:21:45Z) - Dynamic Sub-graph Distillation for Robust Semi-supervised Continual
Learning [52.046037471678005]
We focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories.
We propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning.
arXiv Detail & Related papers (2023-12-27T04:40:12Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.