Do CLIPs Always Generalize Better than ImageNet Models?
- URL: http://arxiv.org/abs/2403.11497v1
- Date: Mon, 18 Mar 2024 06:04:02 GMT
- Title: Do CLIPs Always Generalize Better than ImageNet Models?
- Authors: Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang,
- Abstract summary: Large vision language models, such as CLIPs, have revolutionized modern machine learning.
We show that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group.
Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs.
- Score: 45.87070442259975
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large vision language models, such as CLIPs, have revolutionized modern machine learning. CLIPs have demonstrated great generalizability under distribution shifts, supported by an increasing body of literature. However, the evaluation datasets for CLIPs are variations primarily designed for ImageNet benchmarks, which may not fully reflect the extent to which CLIPs, e.g., pre-trained on LAION, robust to spurious correlations. To bridge the gap, we collect a real-world dataset called CounterAnimal that contains realistic spurious features found in animal photos. CounterAnimal consists of a) the common group: comprising animals on common backgrounds, and b) the counter group: including animals on unusual backgrounds. The performance drops from the common to counter groups quantify the reliance of models on spurious features (i.e., backgrounds) to predict the animals. We find that CLIPs trained on either LAION or the OpenAI data exhibit notable performance drops on the counter group. Surprisingly, we observe that single-modal models trained on ImageNet are more robust than CLIPs. We provide both theoretical and empirical explanations for why CLIPs still learn spurious features. Our findings suggest that distribution shifts remain an open problem for CLIPs, and one needs to be cautious about test setups when evaluating foundation models pre-trained on a significantly different scale and distribution.
Related papers
- Identity Inference from CLIP Models using Only Textual Data [12.497110441765274]
Existing methods for identity inference in CLIP models require querying the model with full PII.
Traditional membership inference attacks (MIAs) train shadow models to mimic the behaviors of the target model.
We propose a textual unimodal detector (TUNI) in CLIP models, a novel method for ID inference that queries the target model with only text data.
arXiv Detail & Related papers (2024-05-23T12:54:25Z) - Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP [3.5999252362400993]
We study whether vision-language models can successfully classify images with novel compositions of attribute-object pairs.
We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization.
Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.
arXiv Detail & Related papers (2024-03-27T12:59:44Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Demystifying CLIP Data [86.34045746910114]
Contrastive Language-Image Pre-training (CLIP) has advanced research and applications in computer vision.
We introduce Metadata-Curated Language-Image Pre-training (MetaCLIP)
MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution.
arXiv Detail & Related papers (2023-09-28T17:59:56Z) - Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning.
Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion.
On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z) - Face Recognition in the age of CLIP & Billion image datasets [0.0]
We evaluate the performance of various CLIP models as zero-shot face recognizers.
We also investigate the robustness of CLIP models against data poisoning attacks.
arXiv Detail & Related papers (2023-01-18T05:34:57Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks.
Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.