DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families
- URL: http://arxiv.org/abs/2211.10156v2
- Date: Mon, 21 Nov 2022 07:40:11 GMT
- Title: DETRDistill: A Universal Knowledge Distillation Framework for
DETR-families
- Authors: Jiahao Chang, Shuo Wang, Guangkai Xu, Zehui Chen, Chenhongyi Yang,
Feng Zhao
- Abstract summary: Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations.
Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
- Score: 11.9748352746424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based detectors (DETRs) have attracted great attention due to
their sparse training paradigm and the removal of post-processing operations,
but the huge model can be computationally time-consuming and difficult to be
deployed in real-world applications. To tackle this problem, knowledge
distillation (KD) can be employed to compress the huge model by constructing a
universal teacher-student learning framework. Different from the traditional
CNN detectors, where the distillation targets can be naturally aligned through
the feature map, DETR regards object detection as a set prediction problem,
leading to an unclear relationship between teacher and student during
distillation. In this paper, we propose DETRDistill, a novel knowledge
distillation dedicated to DETR-families. We first explore a sparse matching
paradigm with progressive stage-by-stage instance distillation. Considering the
diverse attention mechanisms adopted in different DETRs, we propose
attention-agnostic feature distillation module to overcome the ineffectiveness
of conventional feature imitation. Finally, to fully leverage the intermediate
products from the teacher, we introduce teacher-assisted assignment
distillation, which uses the teacher's object queries and assignment results
for a group with additional guidance. Extensive experiments demonstrate that
our distillation method achieves significant improvement on various competitive
DETR approaches, without introducing extra consumption in the inference phase.
To the best of our knowledge, this is the first systematic study to explore a
general distillation method for DETR-style detectors.
Related papers
- Knowledge Distillation via Query Selection for Detection Transformer [25.512519971607237]
This paper addresses the challenge of compressing DETR by leveraging knowledge distillation.
A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately.
Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes.
arXiv Detail & Related papers (2024-09-10T11:49:28Z) - Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Dual Knowledge Distillation for Efficient Sound Event Detection [20.236008919003083]
Sound event detection (SED) is essential for recognizing specific sounds and their temporal locations within acoustic signals.
We introduce a novel framework referred to as dual knowledge distillation for developing efficient SED systems.
arXiv Detail & Related papers (2024-02-05T07:30:32Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Knowledge Distillation for Detection Transformer with Consistent
Distillation Points Sampling [38.60121990752897]
We propose a knowledge distillation paradigm for DETR(KD-DETR) with consistent distillation points sampling.
KD-DETR boosts the performance of DAB-DETR with ResNet-18 and ResNet-50 backbone to 41.4$%$, 45.7$%$ mAP, and ResNet-50 even surpasses the teacher model by $2.2%$.
arXiv Detail & Related papers (2022-11-15T11:52:30Z) - ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.