Conditional DETR for Fast Training Convergence
- URL: http://arxiv.org/abs/2108.06152v3
- Date: Fri, 29 Sep 2023 13:21:57 GMT
- Title: Conditional DETR for Fast Training Convergence
- Authors: Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui
Yuan, Lei Sun, Jingdong Wang
- Abstract summary: We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
- Score: 76.95358216461524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently-developed DETR approach applies the transformer encoder and
decoder architecture to object detection and achieves promising performance. In
this paper, we handle the critical issue, slow training convergence, and
present a conditional cross-attention mechanism for fast DETR training. Our
approach is motivated by that the cross-attention in DETR relies highly on the
content embeddings for localizing the four extremities and predicting the box,
which increases the need for high-quality content embeddings and thus the
training difficulty. Our approach, named conditional DETR, learns a conditional
spatial query from the decoder embedding for decoder multi-head
cross-attention. The benefit is that through the conditional spatial query,
each cross-attention head is able to attend to a band containing a distinct
region, e.g., one object extremity or a region inside the object box. This
narrows down the spatial range for localizing the distinct regions for object
classification and box regression, thus relaxing the dependence on the content
embeddings and easing the training. Empirical results show that conditional
DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for
stronger backbones DC5-R50 and DC5-R101. Code is available at
https://github.com/Atten4Vis/ConditionalDETR.
Related papers
- Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence.
We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders.
Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z) - Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale
Feature Fusion [95.7732308775325]
The proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection.
DETR suffers from slow training convergence, which hinders its applicability to various detection tasks.
We design Semantic-Aligned-Matching DETR++ to accelerate DETR's convergence and improve detection performance.
arXiv Detail & Related papers (2022-07-28T15:34:29Z) - Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS.
Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point.
We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z) - DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [37.61768722607528]
We present a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer)
This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer.
It leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting.
arXiv Detail & Related papers (2022-01-28T18:51:09Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-01-19T03:52:44Z) - UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers [11.251593386108189]
We propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR)
Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder.
UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.
arXiv Detail & Related papers (2020-11-18T05:16:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.