Related papers: Vision Xformers: Efficient Attention for Image Classification

Vision Xformers: Efficient Attention for Image Classification

URL: http://arxiv.org/abs/2107.02239v1
Date: Mon, 5 Jul 2021 19:24:23 GMT
Title: Vision Xformers: Efficient Attention for Image Classification
Authors: Pranav Jeevan, Amit Sethi (Indian Institute of Technology Bombay)
Abstract summary: We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers. We show that ViX performs better than ViT in image classification consuming lesser computing resources.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linear attention mechanisms provide hope for overcoming the bottleneck of quadratic complexity which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers like Performer, Linformer and Nystr\"omformer of linear complexity creating Vision X-formers (ViX). We show that ViX performs better than ViT in image classification consuming lesser computing resources. We further show that replacing the embedding linear layer by convolutional layers in ViX further increases their performance. Our test on recent visions transformer models like LeViT and Compact Convolutional Transformer (CCT) show that replacing the attention with Nystr\"omformer or Performer saves GPU usage and memory without deteriorating performance. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.

Related papers

Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants. We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer [3.158346511479111]
We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs) We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training. Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
arXiv Detail & Related papers (2022-06-01T13:25:32Z)
HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) HiViT enjoys both high efficiency and good performance in MIM. In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z)
Super Vision Transformer [131.4777773281238]
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. Our SuperViT significantly outperforms existing studies on efficient vision transformers.
arXiv Detail & Related papers (2022-05-23T15:42:12Z)
Convolutional Xformers for Vision [2.7188347260210466]
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs) We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
arXiv Detail & Related papers (2022-01-25T12:32:09Z)
Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization. We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z)
Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively. We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z)
Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer' With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z)
CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency. New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.