A Closer Look at the Robustness of Contrastive Language-Image
Pre-Training (CLIP)
- URL: http://arxiv.org/abs/2402.07410v1
- Date: Mon, 12 Feb 2024 05:05:55 GMT
- Title: A Closer Look at the Robustness of Contrastive Language-Image
Pre-Training (CLIP)
- Authors: Weijie Tu, Weijian Deng, Tom Gedeon
- Abstract summary: This study investigates the safety objectives of Contrastive Language-Image Pre-training (CLIP) models.
It focuses on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs.
We consider 10 visual factors (e.g., shape and pattern), 5 types of out-of-distribution data, and 8 natural and challenging test conditions with different shift types.
- Score: 12.5294671061385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated
remarkable generalization capabilities across multiple challenging distribution
shifts. However, there is still much to be explored in terms of their
robustness to the variations of specific visual factors. In real-world
applications, reliable and safe systems must consider other safety objectives
beyond classification accuracy, such as predictive uncertainty. Yet, the
effectiveness of CLIP models on such safety-related features is less-explored.
Driven by the above, this work comprehensively investigates the safety
objectives of CLIP models, specifically focusing on three key properties:
resilience to visual factor variations, calibrated uncertainty estimations, and
the ability to detect anomalous inputs. To this end, we study 83 CLIP models
and 127 ImageNet classifiers. They are diverse in architecture, (pre)training
distribution and training strategies. We consider 10 visual factors (e.g.,
shape and pattern), 5 types of out-of-distribution data, and 8 natural and
challenging test conditions with different shift types, such as texture, style,
and perturbation shifts. Our study has unveiled several previously unknown
insights into CLIP models. For instance, they are not consistently more
calibrated than other ImageNet models, which contradicts existing findings.
Additionally, our analysis underscores the significance of training source
design by showcasing its profound influence on the three safety-related
properties. We believe our comprehensive study can shed light on and help guide
the development of more robust and reliable CLIP models.
Related papers
- Toward a Holistic Evaluation of Robustness in CLIP Models [11.148206692373144]
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential in zero-shot classification.
This work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives.
In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.
arXiv Detail & Related papers (2024-10-02T13:26:17Z) - What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights [67.72413262980272]
Severe data imbalance naturally exists among web-scale vision-language datasets.
We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning.
The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
arXiv Detail & Related papers (2024-05-31T17:57:24Z) - Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results.
It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet.
We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z) - Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations [19.800907485589402]
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks.
These tuned models tend to become highly specialized, limiting their practicality for real-world deployment.
We propose a lightweight representation calibration method for fine-tuning CLIP.
arXiv Detail & Related papers (2024-03-12T01:47:17Z) - Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations.
We introduce a methodology designed to examine how input perturbations affect language models across various scales.
We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z) - Uncertainty-guided Boundary Learning for Imbalanced Social Event
Detection [64.4350027428928]
We propose a novel uncertainty-guided class imbalance learning framework for imbalanced social event detection tasks.
Our model significantly improves social event representation and classification tasks in almost all classes, especially those uncertain ones.
arXiv Detail & Related papers (2023-10-30T03:32:04Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Harnessing Perceptual Adversarial Patches for Crowd Counting [92.79051296850405]
Crowd counting is vulnerable to adversarial examples in the physical world.
This paper proposes the Perceptual Adrial Patch (PAP) generation framework to learn the shared perceptual features between models.
arXiv Detail & Related papers (2021-09-16T13:51:39Z) - Evaluating CLIP: Towards Characterization of Broader Capabilities and
Downstream Implications [8.15254368157658]
We analyze CLIP and highlight some of the challenges such models pose.
We find that CLIP can inherit biases found in prior computer vision systems.
These results add evidence to the growing body of work calling for a change in the notion of a 'better' model.
arXiv Detail & Related papers (2021-08-05T19:05:57Z) - Learning perturbation sets for robust machine learning [97.6757418136662]
We use a conditional generator that defines the perturbation set over a constrained region of the latent space.
We measure the quality of our learned perturbation sets both quantitatively and qualitatively.
We leverage our learned perturbation sets to train models which are empirically and certifiably robust to adversarial image corruptions and adversarial lighting variations.
arXiv Detail & Related papers (2020-07-16T16:39:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.