DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
- URL: http://arxiv.org/abs/2201.12329v1
- Date: Fri, 28 Jan 2022 18:51:09 GMT
- Title: DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
- Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun
Zhu, Lei Zhang
- Abstract summary: We present a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer)
This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer.
It leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting.
- Score: 37.61768722607528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present in this paper a novel query formulation using dynamic anchor boxes
for DETR (DEtection TRansformer) and offer a deeper understanding of the role
of queries in DETR. This new formulation directly uses box coordinates as
queries in Transformer decoders and dynamically updates them layer-by-layer.
Using box coordinates not only helps using explicit positional priors to
improve the query-to-feature similarity and eliminate the slow training
convergence issue in DETR, but also allows us to modulate the positional
attention map using the box width and height information. Such a design makes
it clear that queries in DETR can be implemented as performing soft ROI pooling
layer-by-layer in a cascade manner. As a result, it leads to the best
performance on MS-COCO benchmark among the DETR-like detection models under the
same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50
epochs. We also conducted extensive experiments to confirm our analysis and
verify the effectiveness of our methods. Code is available at
\url{https://github.com/SlongLiu/DAB-DETR}.
Related papers
- End-to-End Lane detection with One-to-Several Transformer [6.79236957488334]
O2SFormer converges 12.5x faster than DETR for the ResNet18 backbone.
O2SFormer with ResNet50 backbone achieves 77.83% F1 score on CULane dataset, outperforming existing Transformer-based and CNN-based detectors.
arXiv Detail & Related papers (2023-05-01T06:07:11Z) - Detection Transformer with Stable Matching [48.963171068785435]
We show that the most important design is to use and only use positional metrics to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost.
We achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings.
arXiv Detail & Related papers (2023-04-10T17:55:37Z) - Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS.
Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point.
We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Anchor DETR: Query Design for Transformer-Based Detector [24.925317590675203]
We propose a novel query design for the transformer-based detectors.
Object queries are based on anchor points, which are widely used in CNN-based detectors.
Our design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects"
arXiv Detail & Related papers (2021-09-15T06:31:55Z) - Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.