Evaluating CLIP: Towards Characterization of Broader Capabilities and
Downstream Implications
- URL: http://arxiv.org/abs/2108.02818v1
- Date: Thu, 5 Aug 2021 19:05:57 GMT
- Title: Evaluating CLIP: Towards Characterization of Broader Capabilities and
Downstream Implications
- Authors: Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong
Wook Kim, Miles Brundage
- Abstract summary: We analyze CLIP and highlight some of the challenges such models pose.
We find that CLIP can inherit biases found in prior computer vision systems.
These results add evidence to the growing body of work calling for a change in the notion of a 'better' model.
- Score: 8.15254368157658
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there have been breakthroughs in computer vision ("CV") models that
are more generalizable with the advent of models such as CLIP and ALIGN. In
this paper, we analyze CLIP and highlight some of the challenges such models
pose. CLIP reduces the need for task specific training data, potentially
opening up many niche tasks to automation. CLIP also allows its users to
flexibly specify image classification classes in natural language, which we
find can shift how biases manifest. Additionally, through some preliminary
probes we find that CLIP can inherit biases found in prior computer vision
systems. Given the wide and unpredictable domain of uses for such models, this
raises questions regarding what sufficiently safe behaviour for such systems
may look like. These results add evidence to the growing body of work calling
for a change in the notion of a 'better' model--to move beyond simply looking
at higher accuracy at task-oriented capability evaluations, and towards a
broader 'better' that takes into account deployment-critical features such as
different use contexts, and people who interact with the model when thinking
about model deployment.
Related papers
- Toward a Holistic Evaluation of Robustness in CLIP Models [11.148206692373144]
Contrastive Language-Image Pre-training (CLIP) models have shown significant potential in zero-shot classification.
This work aims to provide a more comprehensive assessment of CLIP by introducing several new perspectives.
In each aspect, we consider the impact of six factors on CLIP models: model architecture, training distribution, training set size, fine-tuning, contrastive loss, and test-time prompts.
arXiv Detail & Related papers (2024-10-02T13:26:17Z) - CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning [17.614980614656407]
We propose Continual Generative training for Incremental prompt-Learning.
We exploit Variational Autoencoders to learn class-conditioned distributions.
We show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities.
arXiv Detail & Related papers (2024-07-22T16:51:28Z) - Multimodal CLIP Inference for Meta-Few-Shot Image Classification [0.0]
Multimodal foundation models like CLIP learn a joint (image, text) embedding.
This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks.
arXiv Detail & Related papers (2024-03-26T17:47:54Z) - Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations [19.800907485589402]
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse downstream tasks.
These tuned models tend to become highly specialized, limiting their practicality for real-world deployment.
We propose a lightweight representation calibration method for fine-tuning CLIP.
arXiv Detail & Related papers (2024-03-12T01:47:17Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Incremental Object Detection with CLIP [36.478530086163744]
We propose a visual-language model such as CLIP to generate text feature embeddings for different class sets.
We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario.
We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance.
arXiv Detail & Related papers (2023-10-13T01:59:39Z) - Continual Learners are Incremental Model Generalizers [70.34479702177988]
This paper extensively studies the impact of Continual Learning (CL) models as pre-trainers.
We find that the transfer quality of the representation often increases gradually without noticeable degradation in fine-tuning performance.
We propose a new fine-tuning scheme, GLobal Attention Discretization (GLAD), that preserves rich task-generic representation during solving downstream tasks.
arXiv Detail & Related papers (2023-06-21T05:26:28Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances.
We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data.
Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z) - Self-Supervised Models are Continual Learners [79.70541692930108]
We show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for Continual Learning.
We devise a framework for Continual self-supervised visual representation Learning that significantly improves the quality of the learned representations.
arXiv Detail & Related papers (2021-12-08T10:39:13Z) - Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data.
Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model.
Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.