MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
- URL: http://arxiv.org/abs/2307.02321v2
- Date: Thu, 7 Sep 2023 09:36:16 GMT
- Title: MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
- Authors: Jakob Drachmann Havtorn and Amelie Royer and Tijmen Blankevoort and
Babak Ehteshami Bejnordi
- Abstract summary: We introduce a conditional gating mechanism that selects the optimal token scale for every image region.
We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level.
In contrast to token pruning, MSViT does not lose information about the input, thus can be readily applied for dense tasks.
- Score: 14.787864686489032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The input tokens to Vision Transformers carry little semantic meaning as they
are defined as regular equal-sized patches of the input image, regardless of
its content. However, processing uniform background areas of an image should
not necessitate as much compute as dense, cluttered areas. To address this
issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our
method introduces a conditional gating mechanism that selects the optimal token
scale for every image region, such that the number of tokens is dynamically
determined per input. In addition, to enhance the conditional behavior of the
gate during training, we introduce a novel generalization of the batch-shaping
loss. We show that our gating module is able to learn meaningful semantics
despite operating locally at the coarse patch-level. The proposed gating module
is lightweight, agnostic to the choice of transformer backbone, and trained
within a few epochs with little training overhead. Furthermore, in contrast to
token pruning, MSViT does not lose information about the input, thus can be
readily applied for dense tasks. We validate MSViT on the tasks of
classification and segmentation where it leads to improved accuracy-complexity
trade-off.
Related papers
- SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers [0.0]
We introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features.
Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information.
The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.
arXiv Detail & Related papers (2024-11-14T13:15:27Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Understanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive Fields [7.58745191859815]
Vision transformers (ViTs) that model an image as a sequence of partitioned patches have shown notable performance in diverse vision tasks.
We propose explicitly adding a Gaussian attention bias that guides the positional embedding to have the corresponding pattern from the beginning of training.
The results showed that proposed method not only facilitates ViTs to understand images but also boosts their performance on various datasets.
arXiv Detail & Related papers (2023-05-08T14:12:25Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.