Where Should I Spend My FLOPS? Efficiency Evaluations of Visual
Pre-training Methods
- URL: http://arxiv.org/abs/2209.15589v2
- Date: Mon, 3 Oct 2022 17:02:05 GMT
- Title: Where Should I Spend My FLOPS? Efficiency Evaluations of Visual
Pre-training Methods
- Authors: Skanda Koppula, Yazhe Li, Evan Shelhamer, Andrew Jaegle, Nikhil
Parthasarathy, Relja Arandjelovic, Jo\~ao Carreira, Olivier H\'enaff
- Abstract summary: Given a fixed FLOP budget, what are the best datasets, models, and (self-supervised) training methods for obtaining high accuracy on representative visual tasks?
We examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K, and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked Autoencoding, and supervised)
Our results call into question the commonly-held assumption that self-supervised methods inherently scale to large, uncurated data.
- Score: 29.141145775835106
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Self-supervised methods have achieved remarkable success in transfer
learning, often achieving the same or better accuracy than supervised
pre-training. Most prior work has done so by increasing pre-training
computation by adding complex data augmentation, multiple views, or lengthy
training schedules. In this work, we investigate a related, but orthogonal
question: given a fixed FLOP budget, what are the best datasets, models, and
(self-)supervised training methods for obtaining high accuracy on
representative visual tasks? Given the availability of large datasets, this
setting is often more relevant for both academic and industry labs alike. We
examine five large-scale datasets (JFT-300M, ALIGN, ImageNet-1K, ImageNet-21K,
and COCO) and six pre-training methods (CLIP, DINO, SimCLR, BYOL, Masked
Autoencoding, and supervised). In a like-for-like fashion, we characterize
their FLOP and CO$_2$ footprints, relative to their accuracy when transferred
to a canonical image segmentation task. Our analysis reveals strong disparities
in the computational efficiency of pre-training methods and their dependence on
dataset quality. In particular, our results call into question the
commonly-held assumption that self-supervised methods inherently scale to
large, uncurated data. We therefore advocate for (1) paying closer attention to
dataset curation and (2) reporting of accuracies in context of the total
computational cost.
Related papers
- Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding [9.112203072394648]
Power-law scaling indicates that large-scale training with uniform sampling is prohibitively slow.
Active learning methods aim to increase data efficiency by prioritizing learning on the most relevant examples.
arXiv Detail & Related papers (2023-12-08T19:26:13Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Large-scale Dataset Pruning with Dynamic Uncertainty [28.60845105174658]
The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them.
In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop.
arXiv Detail & Related papers (2023-06-08T13:14:35Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.