Visual Transformer for Object Detection
- URL: http://arxiv.org/abs/2206.06323v1
- Date: Wed, 1 Jun 2022 06:13:09 GMT
- Title: Visual Transformer for Object Detection
- Authors: Michael Yang
- Abstract summary: We consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions.
Our model leads to consistent improvements in object detection on COCO across many different models and scales.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional Neural networks (CNN) have been the first choice of paradigm in
many computer vision applications. The convolution operation however has a
significant weakness which is it only operates on a local neighborhood of
pixels, thus it misses global information of the surrounding neighbors.
Transformers, or Self-attention networks to be more specific, on the other
hand, have emerged as a recent advance to capture long range interactions of
the input, but they have mostly been applied to sequence modeling tasks such as
Neural Machine Translation, Image captioning and other Natural Language
Processing tasks. Transformers has been applied to natural language related
tasks and achieved promising results. However, its applications in visual
related tasks are far from being satisfying. Taking into consideration of both
the weaknesses of Convolutional Neural Networks and those of the Transformers,
in this paper, we consider the use of self-attention for discriminative visual
tasks, object detection, as an alternative to convolutions. In this paper, we
propose our model: DetTransNet. Extensive experiments show that our model leads
to consistent improvements in object detection on COCO across many different
models and scales, including ResNets, while keeping the number of parameters
similar. In particular, our method achieves a 1.2% Average Precision
improvement on COCO object detection task over other baseline models.
Related papers
- A Simple yet Effective Network based on Vision Transformer for
Camouflaged Object and Salient Object Detection [33.30644598646274]
We propose a simple yet effective network (SENet) based on vision Transformer (ViT)
To enhance the Transformer's ability to model local information, we propose a local information capture module (LICM)
We also propose a dynamic weighted loss (DW loss) based on Binary Cross-Entropy (BCE) and Intersection over Union (IoU) loss, which guides the network to pay more attention to those smaller and more difficult-to-find target objects.
arXiv Detail & Related papers (2024-02-29T07:29:28Z) - Analyzing Local Representations of Self-supervised Vision Transformers [34.56680159632432]
We present a comparative analysis of various self-supervised Vision Transformers (ViTs)
Inspired by large language models, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning.
arXiv Detail & Related papers (2023-12-31T11:38:50Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.