DINOv2: Learning Robust Visual Features without Supervision
- URL: http://arxiv.org/abs/2304.07193v2
- Date: Fri, 2 Feb 2024 10:24:09 GMT
- Title: DINOv2: Learning Robust Visual Features without Supervision
- Authors: Maxime Oquab, Timoth\'ee Darcet, Th\'eo Moutakanni, Huy Vo, Marc
Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa,
Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell
Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma,
Gabriel Synnaeve, Hu Xu, Herv\'e Jegou, Julien Mairal, Patrick Labatut,
Armand Joulin, Piotr Bojanowski
- Abstract summary: This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
- Score: 75.42921276202522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent breakthroughs in natural language processing for model pretraining
on large quantities of data have opened the way for similar foundation models
in computer vision. These models could greatly simplify the use of images in
any system by producing all-purpose visual features, i.e., features that work
across image distributions and tasks without finetuning. This work shows that
existing pretraining methods, especially self-supervised methods, can produce
such features if trained on enough curated data from diverse sources. We
revisit existing approaches and combine different techniques to scale our
pretraining in terms of data and model size. Most of the technical
contributions aim at accelerating and stabilizing the training at scale. In
terms of data, we propose an automatic pipeline to build a dedicated, diverse,
and curated image dataset instead of uncurated data, as typically done in the
self-supervised literature. In terms of models, we train a ViT model
(Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of
smaller models that surpass the best available all-purpose features, OpenCLIP
(Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
Related papers
- FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - A Simple and Efficient Baseline for Data Attribution on Images [107.12337511216228]
Current state-of-the-art approaches require a large ensemble of as many as 300,000 models to accurately attribute model predictions.
In this work, we focus on a minimalist baseline, utilizing the feature space of a backbone pretrained via self-supervised learning to perform data attribution.
Our method is model-agnostic and scales easily to large datasets.
arXiv Detail & Related papers (2023-11-03T17:29:46Z) - Delving Deeper into Data Scaling in Masked Image Modeling [145.36501330782357]
We conduct an empirical study on the scaling capability of masked image modeling (MIM) methods for visual recognition.
Specifically, we utilize the web-collected Coyo-700M dataset.
Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models.
arXiv Detail & Related papers (2023-05-24T15:33:46Z) - The effectiveness of MAE pre-pretraining for billion-scale pretraining [65.98338857597935]
We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model.
We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition.
arXiv Detail & Related papers (2023-03-23T17:56:12Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z) - Vision Models Are More Robust And Fair When Pretrained On Uncurated
Images Without Supervision [38.22842778742829]
Discriminative self-supervised learning allows training models on any random group of internet images.
We train models on billions of random images without any data pre-processing or prior assumptions about what we want the model to learn.
We extensively study and validate our model performance on over 50 benchmarks including fairness, to distribution shift, geographical diversity, fine grained recognition, image copy detection and many image classification datasets.
arXiv Detail & Related papers (2022-02-16T22:26:47Z) - Towards Efficient and Data Agnostic Image Classification Training
Pipeline for Embedded Systems [0.0]
This work is focusing on reviewing the latest augmentation and regularization methods for the image classification.
We can achieve a reasonable performance on a variety of downstream image classification tasks without manual tuning of parameters to each particular task.
Resulting models are computationally efficient and can be deployed to CPU using the OpenVINO toolkit.
arXiv Detail & Related papers (2021-08-16T12:38:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.