Anchor DETR: Query Design for Transformer-Based Detector
- URL: http://arxiv.org/abs/2109.07107v1
- Date: Wed, 15 Sep 2021 06:31:55 GMT
- Title: Anchor DETR: Query Design for Transformer-Based Detector
- Authors: Yingming Wang, Xiangyu Zhang, Tong Yang, Jian Sun
- Abstract summary: We propose a novel query design for the transformer-based detectors.
Object queries are based on anchor points, which are widely used in CNN-based detectors.
Our design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects"
- Score: 24.925317590675203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel query design for the transformer-based
detectors. In previous transformer-based detectors, the object queries are a
set of learned embeddings. However, each learned embedding does not have an
explicit physical meaning and we can not explain where it will focus on. It is
difficult to optimize as the prediction slot of each object query does not have
a specific mode. In other words, each object query will not focus on a specific
region. To solved these problems, in our query design, object queries are based
on anchor points, which are widely used in CNN-based detectors. So each object
query focus on the objects near the anchor point. Moreover, our query design
can predict multiple objects at one position to solve the difficulty: "one
region, multiple objects". In addition, we design an attention variant, which
can reduce the memory cost while achieving similar or better performance than
the standard attention in DETR. Thanks to the query design and the attention
variant, the proposed detector that we called Anchor DETR, can achieve better
performance and run faster than the DETR with 10$\times$ fewer training epochs.
For example, it achieves 44.2 AP with 16 FPS on the MSCOCO dataset when using
the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the
MSCOCO benchmark prove the effectiveness of the proposed methods. Code is
available at https://github.com/megvii-model/AnchorDETR.
Related papers
- Efficient Decoder for End-to-End Oriented Object Detection in Remote Sensing Images [26.37802649901314]
We propose an end-to-end oriented detector equipped with an efficient decoder.
Rotated RoI attention and Selective Distinct Queries (SDQ) are proposed.
Our method achieves state-of-the-art performance on DIOR-R (67.31% mAP), DOTA-v1.5 (67.43% mAP), and DOTA-v2.0 (53.28% mAP) with the ResNet50 backbone.
arXiv Detail & Related papers (2023-11-29T13:43:17Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - StageInteractor: Query-based Object Detector with Cross-stage
Interaction [21.84964476813102]
We propose a new query-based object detector with cross-stage interaction, coined as StageInteractor.
Our model improves the baseline by 2.2 AP, and achieves 44.8 AP with ResNet-50 as backbone.
With longer training time and 300 queries, StageInteractor achieves 51.1 AP and 52.2 AP with ResNeXt-101-DCN and Swin-S, respectively.
arXiv Detail & Related papers (2023-04-11T04:50:13Z) - Dense Distinct Query for End-to-End Object Detection [39.32011383066249]
One-to-one assignment in object detection has successfully obviated the need for non-maximum suppression.
This paper shows that the solution should be Dense Distinct Queries (DDQ)
DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors.
arXiv Detail & Related papers (2023-03-22T17:42:22Z) - ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers [73.29057814695459]
ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets.
This reduces the need to annotate every object instance in the scene thereby reducing annotation cost.
We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
arXiv Detail & Related papers (2022-09-13T00:11:16Z) - Robust Change Detection Based on Neural Descriptor Fields [53.111397800478294]
We develop an object-level online change detection approach that is robust to partially overlapping observations and noisy localization results.
By associating objects via shape code similarity and comparing local object-neighbor spatial layout, our proposed approach demonstrates robustness to low observation overlap and localization noises.
arXiv Detail & Related papers (2022-08-01T17:45:36Z) - AdaMixer: A Fast-Converging Query-Based Object Detector [32.159871347459166]
We propose a fast-converging query-based object detector named AdaMixer.
AdaMixer has architectural simplicity without requiring explicit pyramid networks.
Our work sheds light on a simple, accurate, and fast converging architecture for query-based object detectors.
arXiv Detail & Related papers (2022-03-30T17:45:02Z) - Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors.
Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf)
For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.