Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
- URL: http://arxiv.org/abs/2303.01112v1
- Date: Thu, 2 Mar 2023 09:47:28 GMT
- Title: Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
- Authors: Sora Takashima, Ryo Hayamizu, Nakamasa Inoue, Hirokatsu Kataoka, Rio
Yokota
- Abstract summary: Formula-driven supervised learning has been shown to be an effective method for pre-training transformers.
VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k.
Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve.
- Score: 18.5408134000081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Formula-driven supervised learning (FDSL) has been shown to be an effective
method for pre-training vision transformers, where ExFractalDB-21k was shown to
exceed the pre-training effect of ImageNet-21k. These studies also indicate
that contours mattered more than textures when pre-training vision
transformers. However, the lack of a systematic investigation as to why these
contour-oriented synthetic datasets can achieve the same accuracy as real
datasets leaves much room for skepticism. In the present work, we develop a
novel methodology based on circular harmonics for systematically investigating
the design space of contour-oriented synthetic datasets. This allows us to
efficiently search the optimal range of FDSL parameters and maximize the
variety of synthetic images in the dataset, which we found to be a critical
factor. When the resulting new dataset VisualAtom-21k is used for pre-training
ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k.
This is close to the top-1 accuracy (84.2%) achieved by JFT-300M pre-training,
while the number of images is 1/14. Unlike JFT-300M which is a static dataset,
the quality of synthetic datasets will continue to improve, and the current
work is a testament to this possibility. FDSL is also free of the common issues
associated with real images, e.g. privacy/copyright issues, labeling
costs/errors, and ethical biases.
Related papers
- Scaling Backwards: Minimal Synthetic Pre-training? [52.78699562832907]
We show that pre-training is effective even with minimal synthetic images.
We find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance.
We extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect.
arXiv Detail & Related papers (2024-08-01T16:20:02Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - DataDAM: Efficient Dataset Distillation with Attention Matching [15.300968899043498]
Researchers have long tried to minimize training costs in deep learning by maintaining strong generalization across diverse datasets.
Emerging research on dataset aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset.
However, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data.
arXiv Detail & Related papers (2023-09-29T19:07:48Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - Replacing Labeled Real-image Datasets with Auto-generated Contours [20.234550996148748]
We show that formula-driven supervised learning can match or even exceed that of ImageNet-21k without the use of real images.
Images generated by formulas avoid the privacy/copyright issues, labeling cost and errors, and biases that real images suffer from.
arXiv Detail & Related papers (2022-06-18T06:43:38Z) - RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis [104.53930611219654]
We present a large-scale synthetic dataset for novel view synthesis consisting of 300k images rendered from nearly 2000 complex scenes.
The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis.
Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures.
arXiv Detail & Related papers (2022-05-14T13:15:32Z) - Task2Sim : Towards Effective Pre-training and Transfer from Synthetic
Data [74.66568380558172]
We study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks.
We introduce Task2Sim, a unified model mapping downstream task representations to optimal simulation parameters.
It learns this mapping by training to find the set of best parameters on a set of "seen" tasks.
Once trained, it can then be used to predict best simulation parameters for novel "unseen" tasks in one shot.
arXiv Detail & Related papers (2021-11-30T19:25:27Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs)
The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification.
We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.