Related papers: SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models

URL: http://arxiv.org/abs/2410.19503v1
Date: Fri, 25 Oct 2024 12:10:49 GMT
Title: SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
Authors: Jahyun Koo, Yerin Hwang, Yongil Kim, Taegwan Kang, Hyunkyung Bae, Kyomin Jung,
Abstract summary: SWITCH (Studying WIth TeaCHer for Knowledge Distillation) is a novel approach that strategically incorporates the teacher model during the student's sequence generation. We show that SWITCH surpasses traditional Knowledge Distillation methods, particularly excelling in the generation of long sequential data.
Score: 16.060402139507644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the success of Large Language Models (LLMs), they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation [57.91828170220308]
We propose a knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models. Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.
arXiv Detail & Related papers (2025-03-23T23:53:08Z)
CFTS-GAN: Continual Few-Shot Teacher Student for Generative Adversarial Networks [0.5024983453990064]
Few-shot and continual learning face two well-known challenges in GANs: overfitting and catastrophic forgetting. This paper proposes a Continual Few-shot Teacher-Student technique for the generative adversarial network (CFTS-GAN) that considers both challenges together.
arXiv Detail & Related papers (2024-10-17T20:49:08Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
On the Resurgence of Recurrent Models for Long Sequences -- Survey and Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence. It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z)
Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation [60.657065293413716]
We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model. We present the discovery that a student model distilled from a few-shot prompted large language models can commonly generalize better than its teacher.
arXiv Detail & Related papers (2023-11-15T01:28:28Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
No Length Left Behind: Enhancing Knowledge Tracing for Modeling Sequences of Excessive or Insufficient Lengths [3.2687390531088414]
Knowledge tracing aims to predict students' responses to practices based on their historical question-answering behaviors. As sequences get longer, computational costs will increase exponentially. We propose a model called Sequence-Flexible Knowledge Tracing (SFKT)
arXiv Detail & Related papers (2023-08-07T11:30:58Z)
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher. We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z)
Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models [10.941519846908697]
We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
arXiv Detail & Related papers (2021-11-05T14:14:05Z)
Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student. IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.