D$^3$ETR: Decoder Distillation for Detection Transformer
- URL: http://arxiv.org/abs/2211.09768v1
- Date: Thu, 17 Nov 2022 18:47:24 GMT
- Title: D$^3$ETR: Decoder Distillation for Detection Transformer
- Authors: Xiaokang Chen, Jiahui Chen, Yan Liu, Gang Zeng
- Abstract summary: We focus on the transformer decoder of DETR-based detectors and explore KD methods for them.
The outputs of the transformer decoder lie in random order, which gives no direct correspondence between the predictions of the teacher and the student.
We build textbfDecoder textbfDistillation for textbfDEtection textbfTRansformer (D$3$ETR) which distills knowledge in decoder predictions and attention maps from the teachers to students.
- Score: 20.493873634246512
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While various knowledge distillation (KD) methods in CNN-based detectors show
their effectiveness in improving small students, the baselines and recipes for
DETR-based detectors are yet to be built. In this paper, we focus on the
transformer decoder of DETR-based detectors and explore KD methods for them.
The outputs of the transformer decoder lie in random order, which gives no
direct correspondence between the predictions of the teacher and the student,
thus posing a challenge for knowledge distillation. To this end, we propose
MixMatcher to align the decoder outputs of DETR-based teachers and students,
which mixes two teacher-student matching strategies, i.e., Adaptive Matching
and Fixed Matching. Specifically, Adaptive Matching applies bipartite matching
to adaptively match the outputs of the teacher and the student in each decoder
layer, while Fixed Matching fixes the correspondence between the outputs of the
teacher and the student with the same object queries, with the teacher's fixed
object queries fed to the decoder of the student as an auxiliary group.
Based on MixMatcher, we build \textbf{D}ecoder \textbf{D}istillation for
\textbf{DE}tection \textbf{TR}ansformer (D$^3$ETR), which distills knowledge in
decoder predictions and attention maps from the teachers to students. D$^3$ETR
shows superior performance on various DETR-based detectors with different
backbones. For example, D$^3$ETR improves Conditional DETR-R50-C5 by
$\textbf{7.8}/\textbf{2.4}$ mAP under $12/50$ epochs training settings with
Conditional DETR-R101-C5 as the teacher.
Related papers
- How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? [99.87554379608224]
Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal.
Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.
We propose a novel Contrastive Partial Ranking Distillation (DCPR) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning.
arXiv Detail & Related papers (2024-07-10T09:10:01Z) - OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer [14.714768026997534]
This paper aims to stabilize DETR training through the online distillation.
It utilizes a teacher model, accumulated by Exponential Moving Average (EMA)
Experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.
arXiv Detail & Related papers (2024-06-09T14:07:35Z) - Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD)
We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector.
Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z) - Detection Transformer with Stable Matching [48.963171068785435]
We show that the most important design is to use and only use positional metrics to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost.
We achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings.
arXiv Detail & Related papers (2023-04-10T17:55:37Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - Exploring Content Relationships for Distilling Efficient GANs [69.86835014810714]
This paper proposes a content relationship distillation (CRD) to tackle the over- parameterized generative adversarial networks (GANs)
In contrast to traditional instance-level distillation, we design a novel GAN compression oriented knowledge by slicing the contents of teacher outputs into multiple fine-grained granularities.
Built upon our proposed content-level distillation, we also deploy an online teacher discriminator, which keeps updating when co-trained with the teacher generator and keeps freezing when co-trained with the student generator for better adversarial training.
arXiv Detail & Related papers (2022-12-21T15:38:12Z) - DETRs with Collaborative Hybrid Assignments Training [11.563949886871713]
We present a novel collaborative hybrid assignments training scheme, namely $mathcalC$o-DETR.
This training scheme can easily enhance the encoder's learning ability in end-to-end detectors.
We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants.
arXiv Detail & Related papers (2022-11-22T16:19:52Z) - Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence.
We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders.
Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z) - G-DetKD: Towards General Distillation Framework for Object Detectors via
Contrastive and Semantic-guided Feature Imitation [49.421099172544196]
We propose a novel semantic-guided feature imitation technique, which automatically performs soft matching between feature pairs across all pyramid levels.
We also introduce contrastive distillation to effectively capture the information encoded in the relationship between different feature regions.
Our method consistently outperforms the existing detection KD techniques, and works when (1) components in the framework are used separately and in conjunction.
arXiv Detail & Related papers (2021-08-17T07:44:27Z) - CoDERT: Distilling Encoder Representations with Co-learning for
Transducer-based Speech Recognition [14.07385381963374]
We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions.
We introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively.
arXiv Detail & Related papers (2021-06-14T20:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.