Related papers: Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

URL: http://arxiv.org/abs/2404.14827v1
Date: Tue, 23 Apr 2024 08:29:56 GMT
Title: Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation
Authors: Jingxuan Wei, Linzhuang Sun, Yichong Leng, Xu Tan, Bihui Yu, Ruifeng Guo,
Abstract summary: Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for simple'' scenarios. We introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism.
Score: 25.58020699235669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for ``simple'' scenarios, while sentence-level distillation excels in ``complex'' scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.

Related papers

Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments. Previous research has introduced several distillation methods for both generating training data and for training the student model. Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z)
Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process. We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
Education distillation:getting student models to learn in shcools [15.473668050280304]
This paper introduces dynamic incremental learning into knowledge distillation. It is proposed to take fragmented student models divided from the complete student model as lower-grade models.
arXiv Detail & Related papers (2023-11-23T05:20:18Z)
Can a student Large Language Model perform as well as it's teacher? [0.0]
Knowledge distillation aims to transfer knowledge from a high-capacity "teacher" model to a streamlined "student" model. This paper provides a comprehensive overview of the knowledge distillation paradigm.
arXiv Detail & Related papers (2023-10-03T20:34:59Z)
The Staged Knowledge Distillation in Video Classification: Harmonizing Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification. Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages. Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation [48.96946395851039]
A common scenario of Multilingual Neural Machine Translation (MNMT) is that each translation task arrives in a sequential manner, and the training data of previous tasks is unavailable. We propose a multilingual distillation method to make the new model jointly learn multilingual output from old model (teacher) and new task. The experimental results on twelve translation tasks show that the proposed methods can better consolidate the previous knowledge and sharply alleviate the CF.
arXiv Detail & Related papers (2022-12-06T07:36:16Z)
Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation [72.70058049274664]
We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision) Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation.
arXiv Detail & Related papers (2022-10-25T07:07:54Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.