Glance-and-Gaze Vision Transformer
- URL: http://arxiv.org/abs/2106.02277v1
- Date: Fri, 4 Jun 2021 06:13:47 GMT
- Title: Glance-and-Gaze Vision Transformer
- Authors: Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen
- Abstract summary: We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
- Score: 13.77016463781053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there emerges a series of vision Transformers, which show superior
performance with a more compact model size than conventional convolutional
neural networks, thanks to the strong ability of Transformers to model
long-range dependencies. However, the advantages of vision Transformers also
come with a price: Self-attention, the core part of Transformer, has a
quadratic complexity to the input sequence length. This leads to a dramatic
increase of computation and memory cost with the increase of sequence length,
thus introducing difficulties when applying Transformers to the vision tasks
that require dense predictions based on high-resolution feature maps. In this
paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer
(GG-Transformer), to address the aforementioned issues. It is motivated by the
Glance and Gaze behavior of human beings when recognizing objects in natural
scenes, with the ability to efficiently model both long-range dependencies and
local context. In GG-Transformer, the Glance and Gaze behavior is realized by
two parallel branches: The Glance branch is achieved by performing
self-attention on the adaptively-dilated partitions of the input, which leads
to a linear complexity while still enjoying a global receptive field; The Gaze
branch is implemented by a simple depth-wise convolutional layer, which
compensates local image context to the features obtained by the Glance
mechanism. We empirically demonstrate our method achieves consistently superior
performance over previous state-of-the-art Transformers on various vision tasks
and benchmarks. The codes and models will be made available at
https://github.com/yucornetto/GG-Transformer.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Gophormer: Ego-Graph Transformer for Node Classification [27.491500255498845]
In this paper, we propose a novel Gophormer model which applies transformers on ego-graphs instead of full-graphs.
Specifically, Node2Seq module is proposed to sample ego-graphs as the input of transformers, which alleviates the challenge of scalability.
In order to handle the uncertainty introduced by the ego-graph sampling, we propose a consistency regularization and a multi-sample inference strategy.
arXiv Detail & Related papers (2021-10-25T16:43:32Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.