Related papers: Conditional DETR V2: Efficient Detection Transformer with Box Queries

Conditional DETR V2: Efficient Detection Transformer with Box Queries

URL: http://arxiv.org/abs/2207.08914v1
Date: Mon, 18 Jul 2022 20:08:55 GMT
Title: Conditional DETR V2: Efficient Detection Transformer with Box Queries
Authors: Xiaokang Chen, Fangyun Wei, Gang Zeng, Jingdong Wang
Abstract summary: We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point. We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
Score: 58.9706842210695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.

Related papers

SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency [40.04140037952051]
DETR-based approaches apply central-concept spatial prior to accelerate Transformer detector convergency. We propose SAlient Point-based DETR (SAP-DETR) by treating object detection as a transformation from salient points to instance objects. Our experiments have demonstrated that SAP-DETR 1.4 times achieves convergency speed with competitive performance.
arXiv Detail & Related papers (2022-11-03T17:20:55Z)
Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence. We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z)
ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers [73.29057814695459]
ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets. This reduces the need to annotate every object instance in the scene thereby reducing annotation cost. We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
arXiv Detail & Related papers (2022-09-13T00:11:16Z)
Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z)
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [37.61768722607528]
We present a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. It leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting.
arXiv Detail & Related papers (2022-01-28T18:51:09Z)
Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity [10.098578160958946]
We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
arXiv Detail & Related papers (2021-11-29T05:22:46Z)
Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities. We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z)
Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder. Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z)
Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network. We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution. Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.