Conditional DETR V2: Efficient Detection Transformer with Box Queries
- URL: http://arxiv.org/abs/2207.08914v1
- Date: Mon, 18 Jul 2022 20:08:55 GMT
- Title: Conditional DETR V2: Efficient Detection Transformer with Box Queries
- Authors: Xiaokang Chen, Fangyun Wei, Gang Zeng, Jingdong Wang
- Abstract summary: We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS.
Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point.
We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
- Score: 58.9706842210695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we are interested in Detection Transformer (DETR), an
end-to-end object detection approach based on a transformer encoder-decoder
architecture without hand-crafted postprocessing, such as NMS. Inspired by
Conditional DETR, an improved DETR with fast training convergence, that
presented box queries (originally called spatial queries) for internal decoder
layers, we reformulate the object query into the format of the box query that
is a composition of the embeddings of the reference point and the
transformation of the box with respect to the reference point. This
reformulation indicates the connection between the object query in DETR and the
anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the
box queries from the image content, further improving the detection quality of
Conditional DETR still with fast training convergence. In addition, we adopt
the idea of axial self-attention to save the memory cost and accelerate the
encoder. The resulting detector, called Conditional DETR V2, achieves better
results than Conditional DETR, saves the memory cost and runs more efficiently.
For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$
AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it
runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves
$1.0$ AP score.
Related papers
- SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based
Transformer Detector for Fast Model Convergency [40.04140037952051]
DETR-based approaches apply central-concept spatial prior to accelerate Transformer detector convergency.
We propose SAlient Point-based DETR (SAP-DETR) by treating object detection as a transformation from salient points to instance objects.
Our experiments have demonstrated that SAP-DETR 1.4 times achieves convergency speed with competitive performance.
arXiv Detail & Related papers (2022-11-03T17:20:55Z) - Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence.
We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders.
Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z) - ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers [73.29057814695459]
ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets.
This reduces the need to annotate every object instance in the scene thereby reducing annotation cost.
We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
arXiv Detail & Related papers (2022-09-13T00:11:16Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR [37.61768722607528]
We present a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer)
This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer.
It leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting.
arXiv Detail & Related papers (2022-01-28T18:51:09Z) - Sparse DETR: Efficient End-to-End Object Detection with Learnable
Sparsity [10.098578160958946]
We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset.
Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
arXiv Detail & Related papers (2021-11-29T05:22:46Z) - Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z) - Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network.
We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution.
Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.