Related papers: Conditional DETR for Fast Training Convergence

Conditional DETR for Fast Training Convergence

URL: http://arxiv.org/abs/2108.06152v3
Date: Fri, 29 Sep 2023 13:21:57 GMT
Title: Conditional DETR for Fast Training Convergence
Authors: Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang
Abstract summary: We present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities. We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
Score: 76.95358216461524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recently-developed DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities and predicting the box, which increases the need for high-quality content embeddings and thus the training difficulty. Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box. This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101. Code is available at https://github.com/Atten4Vis/ConditionalDETR.

Related papers

Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence. We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z)
Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion [95.7732308775325]
The proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We design Semantic-Aligned-Matching DETR++ to accelerate DETR's convergence and improve detection performance.
arXiv Detail & Related papers (2022-07-28T15:34:29Z)
Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point. We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z)
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [37.61768722607528]
We present a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. It leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting.
arXiv Detail & Related papers (2022-01-28T18:51:09Z)
Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper. In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects. REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z)
Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder. Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z)
Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder. Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-01-19T03:52:44Z)
UP-DETR: Unsupervised Pre-training for Object Detection with Transformers [11.251593386108189]
We propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR) Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.
arXiv Detail & Related papers (2020-11-18T05:16:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.