End-to-End Object Detection with Transformers
- URL: http://arxiv.org/abs/2005.12872v3
- Date: Thu, 28 May 2020 17:37:23 GMT
- Title: End-to-End Object Detection with Transformers
- Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
Alexander Kirillov, Sergey Zagoruyko
- Abstract summary: We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
- Score: 88.06357745922716
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We present a new method that views object detection as a direct set
prediction problem. Our approach streamlines the detection pipeline,
effectively removing the need for many hand-designed components like a
non-maximum suppression procedure or anchor generation that explicitly encode
our prior knowledge about the task. The main ingredients of the new framework,
called DEtection TRansformer or DETR, are a set-based global loss that forces
unique predictions via bipartite matching, and a transformer encoder-decoder
architecture. Given a fixed small set of learned object queries, DETR reasons
about the relations of the objects and the global image context to directly
output the final set of predictions in parallel. The new model is conceptually
simple and does not require a specialized library, unlike many other modern
detectors. DETR demonstrates accuracy and run-time performance on par with the
well-established and highly-optimized Faster RCNN baseline on the challenging
COCO object detection dataset. Moreover, DETR can be easily generalized to
produce panoptic segmentation in a unified manner. We show that it
significantly outperforms competitive baselines. Training code and pretrained
models are available at https://github.com/facebookresearch/detr.
Related papers
- A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning.
We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories.
We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution.
We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well.
Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z) - ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers [73.29057814695459]
ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets.
This reduces the need to annotate every object instance in the scene thereby reducing annotation cost.
We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
arXiv Detail & Related papers (2022-09-13T00:11:16Z) - Task Specific Attention is one more thing you need for object detection [0.0]
We propose that combining several attention modules with our new Task Specific Split Transformer(TSST) is a fairly good enough method to produce the best COCO results.
In this paper, we propose that combining several attention modules with our new Task Specific Split Transformer(TSST) is a fairly good enough method to produce the best COCO results.
arXiv Detail & Related papers (2022-02-18T07:09:33Z) - Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network.
We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution.
Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z) - Efficient DETR: Improving End-to-End Object Detector with Dense Prior [7.348184873564071]
We propose Efficient DETR, a simple and efficient pipeline for end-to-end object detection.
By taking advantage of both dense detection and sparse set detection, Efficient DETR leverages dense prior to initialize the object containers.
Experiments conducted on MS COCO show that our method, with only 3 encoder layers and 1 decoder layer, achieves competitive performance with state-of-the-art object detection methods.
arXiv Detail & Related papers (2021-04-03T06:14:24Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers [11.251593386108189]
We propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR)
Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder.
UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.
arXiv Detail & Related papers (2020-11-18T05:16:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.