ACC-ViT : Atrous Convolution's Comeback in Vision Transformers
- URL: http://arxiv.org/abs/2403.04200v1
- Date: Thu, 7 Mar 2024 04:05:16 GMT
- Title: ACC-ViT : Atrous Convolution's Comeback in Vision Transformers
- Authors: Nabil Ibtehaz, Ning Yan, Masood Mortazavi, Daisuke Kihara
- Abstract summary: We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
- Score: 5.224344210588584
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformers have elevated to the state-of-the-art vision architectures
through innovations in attention mechanism inspired from visual perception. At
present two classes of attentions prevail in vision transformers, regional and
sparse attention. The former bounds the pixel interactions within a region; the
latter spreads them across sparse grids. The opposing natures of them have
resulted in a dilemma between either preserving hierarchical relation or
attaining a global context. In this work, taking inspiration from atrous
convolution, we introduce Atrous Attention, a fusion of regional and sparse
attention, which can adaptively consolidate both local and global information,
while maintaining hierarchical relations. As a further tribute to atrous
convolution, we redesign the ubiquitous inverted residual convolution blocks
with atrous convolution. Finally, we propose a generalized, hybrid vision
transformer backbone, named ACC-ViT, following conventional practices for
standard vision tasks. Our tiny version model achieves $\sim 84 \%$ accuracy on
ImageNet-1K, with less than $28.5$ million parameters, which is $0.42\%$
improvement over state-of-the-art MaxViT while having $8.4\%$ less parameters.
In addition, we have investigated the efficacy of ACC-ViT backbone under
different evaluation settings, such as finetuning, linear probing, and
zero-shot learning on tasks involving medical image analysis, object detection,
and language-image contrastive learning. ACC-ViT is therefore a strong vision
backbone, which is also competitive in mobile-scale versions, ideal for niche
applications with small datasets.
Related papers
- Fusion of regional and sparse attention in Vision Transformers [4.782322901897837]
Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions.
We propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information.
Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42%.
arXiv Detail & Related papers (2024-06-13T06:48:25Z) - TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv)
MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones.
We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z) - ACC-UNet: A Completely Convolutional UNet model for the 2020s [2.7013801448234367]
ACC-UNet is a completely convolutional UNet model that brings the best of both worlds, the inherent inductive biases of convnets with the design decisions of transformers.
ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and consistently outperformed convnets, transformers, and their hybrids.
arXiv Detail & Related papers (2023-08-25T21:39:43Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Uniformer: Unified Transformer for Efficient Spatiotemporal
Representation Learning [68.55487598401788]
Recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers.
We propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution self-attention in a concise transformer format.
We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2.
Our UniFormer achieves 8/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods
arXiv Detail & Related papers (2022-01-12T20:02:32Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.