Related papers: Convolutional Xformers for Vision

Convolutional Xformers for Vision

URL: http://arxiv.org/abs/2201.10271v1
Date: Tue, 25 Jan 2022 12:32:09 GMT
Title: Convolutional Xformers for Vision
Authors: Pranav Jeevan and Amit sethi
Abstract summary: Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs) We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
Score: 2.7188347260210466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr\"omformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).

Related papers

Image-GS: Content-Adaptive Image Representation via 2D Gaussians [55.15950594752051]
We propose Image-GS, a content-adaptive image representation. Using anisotropic 2D Gaussians as the basis, Image-GS shows high memory efficiency, supports fast random access, and offers a natural level of detail stack. General efficiency and fidelity of Image-GS are validated against several recent neural image representations and industry-standard texture compressors. We hope this research offers insights for developing new applications that require adaptive quality and resource control, such as machine perception, asset streaming, and content generation.
arXiv Detail & Related papers (2024-07-02T00:45:21Z)
TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv) MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones. We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z)
TEC-Net: Vision Transformer Embrace Convolutional Neural Networks for Medical Image Segmentation [20.976167468217387]
We propose vision Transformer embrace convolutional neural networks for medical image segmentation (TEC-Net) Our network has two advantages. First, dynamic deformable convolution (DDConv) is designed in the CNN branch, which not only overcomes the difficulty of adaptive feature extraction using fixed-size convolution kernels, but also solves the defect that different inputs share the same convolution kernel parameters. Experimental results show that the proposed TEC-Net provides better medical image segmentation results than SOTA methods including CNN and Transformer networks.
arXiv Detail & Related papers (2023-06-07T01:14:16Z)
Optimizing Vision Transformers for Medical Image Segmentation and Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections. CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase. Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z)
Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z)
ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context. We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z)
Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers. We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z)
Content-Aware Convolutional Neural Networks [98.97634685964819]
Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers. We propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel.
arXiv Detail & Related papers (2021-06-30T03:54:35Z)
XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z)
Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.