Related papers: FlexiViT: One Model for All Patch Sizes

FlexiViT: One Model for All Patch Sizes

URL: http://arxiv.org/abs/2212.08013v2
Date: Thu, 23 Mar 2023 21:38:16 GMT
Title: FlexiViT: One Model for All Patch Sizes
Authors: Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic
Abstract summary: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost. We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
Score: 100.52574011880571
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision

Related papers

Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More [34.12661784331014]
We study the information loss caused by patchification-based compressive encoding paradigm. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification. As a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction.
arXiv Detail & Related papers (2025-02-06T03:01:38Z)
MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints. We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B. We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
Rethinking Hierarchicies in Pre-trained Plain Vision Transformer [76.35955924137986]
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. This paper proposes a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training.
arXiv Detail & Related papers (2022-11-03T13:19:23Z)
Accelerating Vision Transformer Training via a Patch Sampling Schedule [0.685316573653194]
We introduce the notion of a Patch Sampling Schedule (PSS) PSS varies the number of Vision Transformer (ViT) patches used per batch during training. We observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference.
arXiv Detail & Related papers (2022-08-19T19:16:46Z)
PatchDropout: Economizing Vision Transformers Using Patch Dropout [9.243684409949436]
We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
arXiv Detail & Related papers (2022-08-10T14:08:55Z)
Patches Are All You Need? [96.88889685873106]
Vision Transformer (ViT) models may exceed their performance in some settings. ViTs require the use of patch embeddings, which group together small regions of the image into single input features. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?
arXiv Detail & Related papers (2022-01-24T16:42:56Z)
So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification. We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding. The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks. T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.