Focal Self-attention for Local-Global Interactions in Vision
Transformers
- URL: http://arxiv.org/abs/2107.00641v1
- Date: Thu, 1 Jul 2021 17:56:09 GMT
- Title: Focal Self-attention for Local-Global Interactions in Vision
Transformers
- Authors: Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu
Yuan, Jianfeng Gao
- Abstract summary: We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
- Score: 90.9169644436091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vision Transformer and its variants have shown great promise on
various computer vision tasks. The ability of capturing short- and long-range
visual dependencies through self-attention is arguably the main source for the
success. But it also brings challenges due to quadratic computational overhead,
especially for the high-resolution vision tasks (e.g., object detection). In
this paper, we present focal self-attention, a new mechanism that incorporates
both fine-grained local and coarse-grained global interactions. Using this new
mechanism, each token attends the closest surrounding tokens at fine
granularity but the tokens far away at coarse granularity, and thus can capture
both short- and long-range visual dependencies efficiently and effectively.
With focal self-attention, we propose a new variant of Vision Transformer
models, called Focal Transformer, which achieves superior performance over the
state-of-the-art vision Transformers on a range of public image classification
and object detection benchmarks. In particular, our Focal Transformer models
with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8
Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution.
Using Focal Transformers as the backbones, we obtain consistent and substantial
improvements over the current state-of-the-art Swin Transformers for 6
different object detection methods trained with standard 1x and 3x schedules.
Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs
on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation,
creating new SoTA on three of the most challenging computer vision tasks.
Related papers
- Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Global Context Vision Transformers [78.5346173956383]
We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision.
We address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture.
Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks.
arXiv Detail & Related papers (2022-06-20T18:42:44Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped
Attention [28.44439386445018]
We propose a Pale-Shaped self-Attention, which performs self-attention within a pale-shaped region.
Compared to the global self-attention, PS-Attention can reduce the computation and memory costs significantly.
We develop a general Vision Transformer backbone with a hierarchical architecture, named Pale Transformer, which achieves 83.4%, 84.3%, and 84.9% Top-1 accuracy with the model size of 22M, 48M, and 85M respectively.
arXiv Detail & Related papers (2021-12-28T05:37:24Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [44.086393272557416]
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.
It surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.
arXiv Detail & Related papers (2021-03-25T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.