Global Context Vision Transformers
- URL: http://arxiv.org/abs/2206.09959v5
- Date: Tue, 6 Jun 2023 08:17:18 GMT
- Title: Global Context Vision Transformers
- Authors: Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo
Molchanov
- Abstract summary: We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
- Score: 78.5346173956383
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose global context vision transformer (GC ViT), a novel architecture
that enhances parameter and compute utilization for computer vision. Our method
leverages global context self-attention modules, joint with standard local
self-attention, to effectively and efficiently model both long and short-range
spatial interactions, without the need for expensive operations such as
computing attention masks or shifting local windows. In addition, we address
the lack of the inductive bias in ViTs, and propose to leverage a modified
fused inverted residual blocks in our architecture. Our proposed GC ViT
achieves state-of-the-art results across image classification, object detection
and semantic segmentation tasks. On ImageNet-1K dataset for classification, the
variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and
85.7% Top-1 accuracy, respectively, at 224 image resolution and without any
pre-training, hence surpassing comparably-sized prior art such as CNN-based
ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin.
Pre-trained GC ViT backbones in downstream tasks of object detection, instance
segmentation, and semantic segmentation using MS COCO and ADE20K datasets
outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO
detection head achieves a box AP of 58.3 on MS COCO dataset.
Related papers
- Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches.
ViT lacks inductive bias during image or video dataset training.
We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z) - DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition [62.95223898214866]
We explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field.
With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages.
Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks.
arXiv Detail & Related papers (2023-02-03T14:59:31Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Lightweight Vision Transformer with Cross Feature Attention [6.103065659061625]
Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations.
ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices.
We propose cross feature attention (XFA) to bring down cost for transformers, and combine efficient mobile CNNs to form a novel light-weight CNN-ViT hybrid model, XFormer.
arXiv Detail & Related papers (2022-07-15T03:27:13Z) - SepViT: Separable Vision Transformer [20.403430632658946]
Vision Transformers often rely on extensive computational costs to achieve high performance, which is burdensome to deploy on resource-constrained devices.
We draw lessons from depthwise separable convolution and imitate its ideology to design an efficient Transformer backbone, i.e., Separable Vision Transformer, abbreviated as SepViT.
SepViT helps to carry out the local-global information interaction within and among the windows in sequential order via a depthwise separable self-attention.
arXiv Detail & Related papers (2022-03-29T09:20:01Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.