Demystifying CLIP Data
- URL: http://arxiv.org/abs/2309.16671v4
- Date: Sun, 7 Apr 2024 17:22:46 GMT
- Title: Demystifying CLIP Data
- Authors: Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer,
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
- Score: 86.34045746910114
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at https://github.com/facebookresearch/MetaCLIP.
Related papers
- CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning [19.100022935748225]
Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP)
Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data, and (3) designing better metrics or strategies universally applicable to any CLIP embedding.
arXiv Detail & Related papers (2024-05-29T22:19:57Z) - MoDE: CLIP Data Experts via Clustering [119.88225963429569]
We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering.
Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters.
Experiments show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification.
arXiv Detail & Related papers (2024-04-24T17:59:24Z) - Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies [27.809995478990544]
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets.
We show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality.
We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource.
arXiv Detail & Related papers (2024-04-12T02:04:34Z) - A Sober Look at the Robustness of CLIPs to Spurious Features [45.87070442259975]
We create a new dataset named CounterAnimal to reveal the reliance of CLIP models on realistic spurious features.
Our evaluations show that the spurious features captured by CounterAnimal are generically learned by CLIP models with different backbones and pre-train data, yet have limited influence for ImageNet models.
arXiv Detail & Related papers (2024-03-18T06:04:02Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - DataComp: In search of the next generation of multimodal datasets [179.79323076587255]
DataComp is a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Our benchmark consists of multiple compute scales spanning four orders of magnitude.
In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2023-04-27T11:37:18Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [119.70303730341938]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.
ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.
We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.