FlexiViT: One Model for All Patch Sizes
- URL: http://arxiv.org/abs/2212.08013v2
- Date: Thu, 23 Mar 2023 21:38:16 GMT
- Title: FlexiViT: One Model for All Patch Sizes
- Authors: Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron,
Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim
Alabdulmohsin, Filip Pavetic
- Abstract summary: Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
- Score: 100.52574011880571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller
patches leading to higher accuracy at greater computational cost, but changing
the patch size typically requires retraining the model. In this paper, we
demonstrate that simply randomizing the patch size at training time leads to a
single set of weights that performs well across a wide range of patch sizes,
making it possible to tailor the model to different compute budgets at
deployment time. We extensively evaluate the resulting model, which we call
FlexiViT, on a wide range of tasks, including classification, image-text
retrieval, open-world detection, panoptic segmentation, and semantic
segmentation, concluding that it usually matches, and sometimes outperforms,
standard ViT models trained at a single patch size in an otherwise identical
setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that
makes it easy to add compute-adaptive capabilities to most models relying on a
ViT backbone architecture. Code and pre-trained models are available at
https://github.com/google-research/big_vision
Related papers
- MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective.
customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT.
This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z) - Accelerating Vision Transformer Training via a Patch Sampling Schedule [0.685316573653194]
We introduce the notion of a Patch Sampling Schedule (PSS)
PSS varies the number of Vision Transformer (ViT) patches used per batch during training.
We observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference.
arXiv Detail & Related papers (2022-08-19T19:16:46Z) - PatchDropout: Economizing Vision Transformers Using Patch Dropout [9.243684409949436]
We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches.
We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
arXiv Detail & Related papers (2022-08-10T14:08:55Z) - Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings.
ViTs require the use of patch embeddings, which group together small regions of the image into single input features.
This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Automatic Cross-Replica Sharding of Weight Update in Data-Parallel
Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas.
We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.