RaViTT: Random Vision Transformer Tokens
- URL: http://arxiv.org/abs/2306.10959v1
- Date: Mon, 19 Jun 2023 14:24:59 GMT
- Title: RaViTT: Random Vision Transformer Tokens
- Authors: Felipe A. Quezada, Carlos F. Navarro, Cristian Mu\~noz, Manuel
Zamorano, Jorge Jara-Wilde, Violeta Chang, Crist\'obal A. Navarro, Mauricio
Cerda
- Abstract summary: Vision Transformers (ViTs) have successfully been applied to image classification problems where large annotated datasets are available.
We propose Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy that can be incorporated into existing ViTs.
- Score: 0.41776442767736593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have successfully been applied to image
classification problems where large annotated datasets are available. On the
other hand, when fewer annotations are available, such as in biomedical
applications, image augmentation techniques like introducing image variations
or combinations have been proposed. However, regarding ViT patch sampling, less
has been explored outside grid-based strategies. In this work, we propose
Random Vision Transformer Tokens (RaViTT), a random patch sampling strategy
that can be incorporated into existing ViTs. We experimentally evaluated RaViTT
for image classification, comparing it with a baseline ViT and state-of-the-art
(SOTA) augmentation techniques in 4 datasets, including ImageNet-1k and
CIFAR-100. Results show that RaViTT increases the accuracy of the baseline in
all datasets and outperforms the SOTA augmentation techniques in 3 out of 4
datasets by a significant margin +1.23% to +4.32%. Interestingly, RaViTT
accuracy improvements can be achieved even with fewer tokens, thus reducing the
computational load of any ViT model for a given accuracy value.
Related papers
- Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - ViR:the Vision Reservoir [10.881974985012839]
Vision Reservoir computing (ViR) is proposed here for image classification, as a parallel to Vision Transformer (ViT)
By splitting each image into a sequence of tokens with fixed length, the ViR constructs a pure reservoir with a nearly fully connected topology to replace the Transformer module in ViT.
The number of parameters of the ViR is about 15% even 5% of the ViT, and the memory footprint is about 20% to 40% of the ViT.
arXiv Detail & Related papers (2021-12-27T07:07:50Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.