Progressive Class-level Distillation
- URL: http://arxiv.org/abs/2505.24310v1
- Date: Fri, 30 May 2025 07:49:01 GMT
- Title: Progressive Class-level Distillation
- Authors: Jiayan Li, Jun Li, Zhourui Zhang, Jianhua Xu,
- Abstract summary: We propose a Progressive Class-level Distillation (PCD) method for logit distillation.<n>Our PCD approach performs stage-wise distillation for step-by-step knowledge transfer.<n>Our method is superior to state-of-the-arts for both classification and detection tasks.
- Score: 4.33169417430713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In knowledge distillation (KD), logit distillation (LD) aims to transfer class-level knowledge from a more powerful teacher network to a small student model via accurate teacher-student alignment at the logits level. Since high-confidence object classes usually dominate the distillation process, low-probability classes which also contain discriminating information are downplayed in conventional methods, leading to insufficient knowledge transfer. To address this issue, we propose a simple yet effective LD method termed Progressive Class-level Distillation (PCD). In contrast to existing methods which perform all-class ensemble distillation, our PCD approach performs stage-wise distillation for step-by-step knowledge transfer. More specifically, we perform ranking on teacher-student logits difference for identifying distillation priority from scratch, and subsequently divide the entire LD process into multiple stages. Next, bidirectional stage-wise distillation incorporating fine-to-coarse progressive learning and reverse coarse-to-fine refinement is conducted, allowing comprehensive knowledge transfer via sufficient logits alignment within separate class groups in different distillation stages. Extension experiments on public benchmarking datasets demonstrate the superiority of our method compared to state-of-the-arts for both classification and detection tasks.
Related papers
- Efficient Knowledge Injection in LLMs via Self-Distillation [50.24554628642021]
This paper proposes utilizing prompt distillation to internalize new factual knowledge from free-form documents.<n>We show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG.
arXiv Detail & Related papers (2024-12-19T15:44:01Z) - Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.<n>Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.<n>Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations.
Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z) - ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Localization Distillation for Object Detection [134.12664548771534]
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the classification logits.
We present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student.
We show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years.
arXiv Detail & Related papers (2022-04-12T17:14:34Z) - LRC-BERT: Latent-representation Contrastive Knowledge Distillation for
Natural Language Understanding [12.208166079145538]
We propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect.
By verifying 8 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-12-14T08:39:38Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.