Vision Transformers with Mixed-Resolution Tokenization
- URL: http://arxiv.org/abs/2304.00287v2
- Date: Thu, 27 Apr 2023 13:16:38 GMT
- Title: Vision Transformers with Mixed-Resolution Tokenization
- Authors: Tomer Ronen, Omer Levy, Avram Golbert
- Abstract summary: Vision Transformer models process input images by dividing them into a spatially regular grid of equal-size patches.
We introduce a novel image tokenization scheme, replacing the standard uniform grid with a mixed-resolution sequence of tokens.
Using the Quadtree algorithm and a novel saliency scorer, we construct a patch mosaic where low-saliency areas of the image are processed in low resolution.
- Score: 34.18534105043819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer models process input images by dividing them into a
spatially regular grid of equal-size patches. Conversely, Transformers were
originally introduced over natural language sequences, where each token
represents a subword - a chunk of raw data of arbitrary size. In this work, we
apply this approach to Vision Transformers by introducing a novel image
tokenization scheme, replacing the standard uniform grid with a
mixed-resolution sequence of tokens, where each token represents a patch of
arbitrary size. Using the Quadtree algorithm and a novel saliency scorer, we
construct a patch mosaic where low-saliency areas of the image are processed in
low resolution, routing more of the model's capacity to important image
regions. Using the same architecture as vanilla ViTs, our Quadformer models
achieve substantial accuracy gains on image classification when controlling for
the computational budget. Code and models are publicly available at
https://github.com/TomerRonen34/mixed-resolution-vit .
Related papers
- Patch Is Not All You Need [57.290256181083016]
We propose a novel Pattern Transformer to adaptively convert images to pattern sequences for Transformer input.
We employ the Convolutional Neural Network to extract various patterns from the input image.
We have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.
arXiv Detail & Related papers (2023-08-21T13:54:00Z) - Cascaded Cross-Attention Networks for Data-Efficient Whole-Slide Image
Classification Using Transformers [0.11219061154635457]
Whole-Slide Imaging allows for the capturing and digitization of high-resolution images of histological specimen.
transformer architecture has been proposed as a possible candidate for effectively leveraging the high-resolution information.
We propose a novel cascaded cross-attention network (CCAN) based on the cross-attention mechanism that scales linearly with the number of extracted patches.
arXiv Detail & Related papers (2023-05-11T16:42:24Z) - Vision Transformer Based Model for Describing a Set of Images as a Story [26.717033245063092]
We propose a novel Vision Transformer Based Model for describing a set of images as a story.
The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT)
The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST)
arXiv Detail & Related papers (2022-10-06T09:01:50Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - HIPA: Hierarchical Patch Transformer for Single Image Super Resolution [62.7081074931892]
This paper presents HIPA, a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition.
We build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge to the full resolution.
Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions.
arXiv Detail & Related papers (2022-03-19T05:09:34Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - Visual Transformers: Token-based Image Representation and Processing for
Computer Vision [67.55770209540306]
Visual Transformer ( VT) operates in a semantic token space, judiciously attending to different image parts based on context.
Using an advanced training recipe, our VTs significantly outperform their convolutional counterparts.
For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.
arXiv Detail & Related papers (2020-06-05T20:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.