Conviformers: Convolutionally guided Vision Transformer
- URL: http://arxiv.org/abs/2208.08900v1
- Date: Wed, 17 Aug 2022 13:09:24 GMT
- Title: Conviformers: Convolutionally guided Vision Transformer
- Authors: Mohit Vaishnav, Thomas Fel, Ivan Felipe Rodr{\i}guez and Thomas Serre
- Abstract summary: We present an in-depth analysis and describe the critical components for developing a system for the fine-grained categorization of plants from herbarium sheets.
We introduce a convolutional transformer architecture called Conviformer which, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost.
With our simple yet effective approach, we achieved SoTA on Herbarium 202x and iNaturalist 2019 dataset.
- Score: 5.964436882344729
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision transformers are nowadays the de-facto preference for image
classification tasks. There are two broad categories of classification tasks,
fine-grained and coarse-grained. In fine-grained classification, the necessity
is to discover subtle differences due to the high level of similarity between
sub-classes. Such distinctions are often lost as we downscale the image to save
the memory and computational cost associated with vision transformers (ViT). In
this work, we present an in-depth analysis and describe the critical components
for developing a system for the fine-grained categorization of plants from
herbarium sheets. Our extensive experimental analysis indicated the need for a
better augmentation technique and the ability of modern-day neural networks to
handle higher dimensional images. We also introduce a convolutional transformer
architecture called Conviformer which, unlike the popular Vision Transformer
(ConViT), can handle higher resolution images without exploding memory and
computational cost. We also introduce a novel, improved pre-processing
technique called PreSizer to resize images better while preserving their
original aspect ratios, which proved essential for classifying natural plants.
With our simple yet effective approach, we achieved SoTA on Herbarium 202x and
iNaturalist 2019 dataset.
Related papers
- Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Optimizing Vision Transformers for Medical Image Segmentation and
Few-Shot Domain Adaptation [11.690799827071606]
We propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections.
CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase.
Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters.
arXiv Detail & Related papers (2022-10-14T19:18:52Z) - Class-Aware Generative Adversarial Transformers for Medical Image
Segmentation [39.14169989603906]
We present CA-GANformer, a novel type of generative adversarial transformers, for medical image segmentation.
First, we take advantage of the pyramid structure to construct multi-scale representations and handle multi-scale variations.
We then design a novel class-aware transformer module to better learn the discriminative regions of objects with semantic structures.
arXiv Detail & Related papers (2022-01-26T03:50:02Z) - Convolutional Xformers for Vision [2.7188347260210466]
Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks.
The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs)
We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations.
We replace the quadratic attention with linear attention mechanisms, such as Performer, Nystr"omformer, and Linear Transformer, to reduce its GPU usage.
arXiv Detail & Related papers (2022-01-25T12:32:09Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Exploring Vision Transformers for Fine-grained Classification [0.0]
We propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes.
We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology.
arXiv Detail & Related papers (2021-06-19T23:57:31Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Training Vision Transformers for Image Retrieval [32.09708181236154]
We adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective.
Our results show consistent and significant improvements of transformers over convolution-based approaches.
arXiv Detail & Related papers (2021-02-10T18:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.