Pre-training Vision Transformers with Very Limited Synthesized Images
- URL: http://arxiv.org/abs/2307.14710v2
- Date: Mon, 31 Jul 2023 01:06:05 GMT
- Title: Pre-training Vision Transformers with Very Limited Synthesized Images
- Authors: Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima, Edgar Josafat
Martinez Noriega, Rio Yokota and Nakamasa Inoue
- Abstract summary: Formula-driven supervised learning (F) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals.
Prior work on F has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks.
- Score: 18.627567043226172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Formula-driven supervised learning (FDSL) is a pre-training method that
relies on synthetic images generated from mathematical formulae such as
fractals. Prior work on FDSL has shown that pre-training vision transformers on
such synthetic datasets can yield competitive accuracy on a wide range of
downstream tasks. These synthetic images are categorized according to the
parameters in the mathematical formula that generate them. In the present work,
we hypothesize that the process for generating different instances for the same
category in FDSL, can be viewed as a form of data augmentation. We validate
this hypothesis by replacing the instances with data augmentation, which means
we only need a single image per category. Our experiments shows that this
one-instance fractal database (OFDB) performs better than the original dataset
where instances were explicitly generated. We further scale up OFDB to 21,000
categories and show that it matches, or even surpasses, the model pre-trained
on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is
21k, whereas ImageNet-21k has 14M. This opens new possibilities for
pre-training vision transformers with much smaller datasets.
Related papers
- Scaling Backwards: Minimal Synthetic Pre-training? [52.78699562832907]
We show that pre-training is effective even with minimal synthetic images.
We find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance.
We extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect.
arXiv Detail & Related papers (2024-08-01T16:20:02Z) - SynCDR : Training Cross Domain Retrieval Models with Synthetic Data [69.26882668598587]
In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains.
We show how to generate synthetic data to fill in these missing category examples across domains.
Our best SynCDR model can outperform prior art by up to 15%.
arXiv Detail & Related papers (2023-12-31T08:06:53Z) - Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves [18.5408134000081]
Formula-driven supervised learning has been shown to be an effective method for pre-training transformers.
VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k.
Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve.
arXiv Detail & Related papers (2023-03-02T09:47:28Z) - Replacing Labeled Real-image Datasets with Auto-generated Contours [20.234550996148748]
We show that formula-driven supervised learning can match or even exceed that of ImageNet-21k without the use of real images.
Images generated by formulas avoid the privacy/copyright issues, labeling cost and errors, and biases that real images suffer from.
arXiv Detail & Related papers (2022-06-18T06:43:38Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Feature transforms for image data augmentation [74.12025519234153]
In image classification, many augmentation approaches utilize simple image manipulation algorithms.
In this work, we build ensembles on the data level by adding images generated by combining fourteen augmentation approaches.
Pretrained ResNet50 networks are finetuned on training sets that include images derived from each augmentation method.
arXiv Detail & Related papers (2022-01-24T14:12:29Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.