Related papers: TiC: Exploring Vision Transformer in Convolution

TiC: Exploring Vision Transformer in Convolution

URL: http://arxiv.org/abs/2310.04134v2
Date: Mon, 27 May 2024 14:37:59 GMT
Title: TiC: Exploring Vision Transformer in Convolution
Authors: Song Zhang, Qingzhong Wang, Jiang Bian, Haoyi Xiong,
Abstract summary: We propose the Multi-Head Self-Attention Convolution (MSA-Conv) MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
Score: 37.50285921899263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While models derived from Vision Transformers (ViTs) have been phonemically surging, pre-trained models cannot seamlessly adapt to arbitrary resolution images without altering the architecture and configuration, such as sampling the positional encoding, limiting their flexibility for various vision tasks. For instance, the Segment Anything Model (SAM) based on ViT-Huge requires all input images to be resized to 1024$\times$1024. To overcome this limitation, we propose the Multi-Head Self-Attention Convolution (MSA-Conv) that incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. Enabling transformers to handle images of varying sizes without retraining or rescaling, the use of MSA-Conv further reduces computational costs compared to global attention in ViT, which grows costly as image size increases. Later, we present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv, where two capacity enhancing strategies, namely Multi-Directional Cyclic Shifted Mechanism and Inter-Pooling Mechanism, have been proposed, through establishing long-distance connections between tokens and enlarging the effective receptive field. Extensive experiments have been carried out to validate the overall effectiveness of TiC. Additionally, ablation studies confirm the performance improvement made by MSA-Conv and the two capacity enhancing strategies separately. Note that our proposal aims at studying an alternative to the global attention used in ViT, while MSA-Conv meets our goal by making TiC comparable to state-of-the-art on ImageNet-1K. Code will be released at https://github.com/zs670980918/MSA-Conv.

Related papers

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation. We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales. We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations [61.95114821573875]
We introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper- parameter tuning. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset.
arXiv Detail & Related papers (2022-01-31T02:12:45Z)
Convolutional Xformers for Vision [2.7188347260210466]
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs) We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
arXiv Detail & Related papers (2022-01-25T12:32:09Z)
Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks. We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer. The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z)
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer. It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.