Related papers: Multi-Granularity Semantic Revision for Large Language Model Distillation

Multi-Granularity Semantic Revision for Large Language Model Distillation

URL: http://arxiv.org/abs/2407.10068v1
Date: Sun, 14 Jul 2024 03:51:49 GMT
Title: Multi-Granularity Semantic Revision for Large Language Model Distillation
Authors: Xiaoyu Liu, Yun Zhang, Wei Li, Simiao Li, Xudong Huang, Hanting Chen, Yehui Tang, Jie Hu, Zhiwei Xiong, Yunhe Wang,
Abstract summary: We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
Score: 66.03746866578274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation plays a key role in compressing the Large Language Models (LLMs), which boosts a small-size student model under large teacher models' guidance. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, the distillation loss functions introduced in previous art struggle to align the most informative part due to the complex distribution of LLMs' outputs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation (SCRG) strategy. SCRG first calculates the semantic cognitive difference between the teacher and student to detect the error token, then corrects it with the teacher-generated one, and re-generates the sequence to reduce generation errors and enhance generation diversity. At the token level, we design a distribution adaptive clipping Kullback-Leibler (DAC-KL) loss as the distillation objective function. DAC-KL loss exploits a learnable sub-network to adaptively extract semantically dense areas from the teacher's output, avoiding the interference of redundant information in the distillation process. Finally, at the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent, further enhancing the transfer of semantic information. Extensive experiments across different model families with parameters ranging from 0.1B to 13B demonstrate the superiority of our method compared to existing methods.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Adding Additional Control to One-Step Diffusion with Joint Distribution Matching [58.37264951734603]
JDM is a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model.
arXiv Detail & Related papers (2025-03-09T15:06:50Z)
Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z)
AICSD: Adaptive Inter-Class Similarity Distillation for Semantic Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation. The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs. Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z)
f-Divergence Minimization for Sequence-Level Knowledge Distillation [23.513372304624486]
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. We propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. Experiments across four datasets show that our methods outperform existing KD approaches.
arXiv Detail & Related papers (2023-07-27T20:39:06Z)
Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners [102.20090188997301]
We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
arXiv Detail & Related papers (2023-06-28T02:19:35Z)
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher. We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z)
Knowledge Diffusion for Distillation [53.908314960324915]
The representation gap between teacher and student is an emerging topic in knowledge distillation (KD) We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature. We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
arXiv Detail & Related papers (2023-05-25T04:49:34Z)
Mind the Gap in Distilling StyleGANs [100.58444291751015]
StyleGAN family is one of the most popular Generative Adversarial Networks (GANs) for unconditional generation. This paper provides a comprehensive study of distilling from the popular StyleGAN-like architecture.
arXiv Detail & Related papers (2022-08-18T14:18:29Z)
Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder. Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input. In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z)
Deep Semi-supervised Knowledge Distillation for Overlapping Cervical Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation. We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining. Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z)
Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student. Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher. We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.