Toward Transformer-Based Object Detection
- URL: http://arxiv.org/abs/2012.09958v1
- Date: Thu, 17 Dec 2020 22:33:14 GMT
- Title: Toward Transformer-Based Object Detection
- Authors: Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry
Kislyuk
- Abstract summary: Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
- Score: 12.704056181392415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have become the dominant model in natural language processing,
owing to their ability to pretrain on massive amounts of data, then transfer to
smaller, more specific tasks via fine-tuning. The Vision Transformer was the
first major attempt to apply a pure transformer model directly to images as
input, demonstrating that as compared to convolutional networks,
transformer-based architectures can achieve competitive results on benchmark
classification tasks. However, the computational complexity of the attention
operator means that we are limited to low-resolution inputs. For more complex
tasks such as detection or segmentation, maintaining a high input resolution is
crucial to ensure that models can properly identify and reflect fine details in
their output. This naturally raises the question of whether or not
transformer-based architectures such as the Vision Transformer are capable of
performing tasks other than classification. In this paper, we determine that
Vision Transformers can be used as a backbone by a common detection task head
to produce competitive COCO results. The model that we propose, ViT-FRCNN,
demonstrates several known properties associated with transformers, including
large pretraining capacity and fast fine-tuning performance. We also
investigate improvements over a standard detection backbone, including superior
performance on out-of-domain images, better performance on large objects, and a
lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important
stepping stone toward a pure-transformer solution of complex vision tasks such
as object detection.
Related papers
- An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot
MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD)
Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.