Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision
- URL: http://arxiv.org/abs/2202.08360v2
- Date: Tue, 22 Feb 2022 18:15:21 GMT
- Title: Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision
- Authors: Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan
Misra, Levent Sagun, Armand Joulin, Piotr Bojanowski
- Abstract summary: Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
- Score: 38.22842778742829
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discriminative self-supervised learning allows training models on any random
group of internet images, and possibly recover salient information that helps
differentiate between the images. Applied to ImageNet, this leads to object
centric features that perform on par with supervised features on most
object-centric downstream tasks. In this work, we question if using this
ability, we can learn any salient and more representative information present
in diverse unbounded set of images from across the globe. To do so, we train
models on billions of random images without any data pre-processing or prior
assumptions about what we want the model to learn. We scale our model size to
dense 10 billion parameters to avoid underfitting on a large data size. We
extensively study and validate our model performance on over 50 benchmarks
including fairness, robustness to distribution shift, geographical diversity,
fine grained recognition, image copy detection and many image classification
datasets. The resulting model, not only captures well semantic information, it
also captures information about artistic style and learns salient information
such as geolocations and multilingual word embeddings based on visual content
only. More importantly, we discover that such model is more robust, more fair,
less harmful and less biased than supervised models or models trained on object
centric datasets such as ImageNet.
Related papers
- Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models [36.59260354292177]
Recent advancements in text-to-image generation have inspired researchers to generate datasets tailored for perception models using generative models.
We aim to fine-tune vision-language models to a specific classification model without access to any real images.
Despite the high fidelity of generated images, we observed a significant performance degradation when fine-tuning the model using the generated datasets.
arXiv Detail & Related papers (2024-06-08T10:43:49Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - Multi-task pre-training of deep neural networks for digital pathology [8.74883469030132]
We first assemble and transform many digital pathology datasets into a pool of 22 classification tasks and almost 900k images.
We show that our models used as feature extractors either improve significantly over ImageNet pre-trained models or provide comparable performance.
arXiv Detail & Related papers (2020-05-05T08:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.