Investigating the Limitation of CLIP Models: The Worst-Performing
Categories
- URL: http://arxiv.org/abs/2310.03324v1
- Date: Thu, 5 Oct 2023 05:37:33 GMT
- Title: Investigating the Limitation of CLIP Models: The Worst-Performing
Categories
- Authors: Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
- Score: 53.360239882501325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) provides a foundation model by
integrating natural language into visual concepts, enabling zero-shot
recognition on downstream tasks. It is usually expected that satisfactory
overall accuracy can be achieved across numerous domains through well-designed
textual prompts. However, we found that their performance in the worst
categories is significantly inferior to the overall performance. For example,
on ImageNet, there are a total of 10 categories with class-wise accuracy as low
as 0\%, even though the overall performance has achieved 64.1\%. This
phenomenon reveals the potential risks associated with using CLIP models,
particularly in risk-sensitive applications where specific categories hold
significant importance. To address this issue, we investigate the alignment
between the two modalities in the CLIP model and propose the Class-wise
Matching Margin (\cmm) to measure the inference confusion. \cmm\ can
effectively identify the worst-performing categories and estimate the potential
performance of the candidate prompts. We further query large language models to
enrich descriptions of worst-performing categories and build a weighted
ensemble to highlight the efficient prompts. Experimental results clearly
verify the effectiveness of our proposal, where the accuracy on the worst-10
categories on ImageNet is boosted to 5.2\%, without manual prompt engineering,
laborious optimization, or access to labeled validation data.
Related papers
- On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models [0.0]
This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods.
Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models.
arXiv Detail & Related papers (2024-05-18T14:12:04Z) - Dual-Modal Prompting for Sketch-Based Image Retrieval [76.12076969949062]
We propose a dual-modal CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed.
We employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales.
Our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot method by 7.3% in Acc.@1 on the Sketchy dataset.
arXiv Detail & Related papers (2024-04-29T13:43:49Z) - Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science [27.727207443432278]
We evaluate the zero-shot performance of two publicly accessible Large Language Models, ChatGPT and OpenAssistant.
We find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10%.
arXiv Detail & Related papers (2023-05-23T17:48:21Z) - Large Language Models in the Workplace: A Case Study on Prompt
Engineering for Job Type Classification [58.720142291102135]
This case study investigates the task of job classification in a real-world setting.
The goal is to determine whether an English-language job posting is appropriate for a graduate or entry-level position.
arXiv Detail & Related papers (2023-03-13T14:09:53Z) - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1
Accuracy with ViT-B and ViT-L on ImageNet [139.56863124214905]
We find that fine-tuning performance of CLIP is substantially underestimated.
Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset.
arXiv Detail & Related papers (2022-12-12T18:59:59Z) - Towards Reliable Zero Shot Classification in Self-Supervised Models with
Conformal Prediction [0.688204255655161]
We develop a conformal prediction procedure to assess when a given test caption may be reliably used.
We show that our proposed conformal procedure improves the reliability of CLIP-style models in the zero-shot classification setting.
arXiv Detail & Related papers (2022-10-27T23:52:14Z) - Learning to Decompose Visual Features with Latent Textual Prompts [140.2117637223449]
We propose Decomposed Feature Prompting (DeFo) to improve vision-language models.
Our empirical study shows DeFo's significance in improving the vision-language models.
arXiv Detail & Related papers (2022-10-09T15:40:13Z) - Prune Responsibly [0.913755431537592]
Irrespective of the specific definition of fairness in a machine learning application, pruning the underlying model affects it.
We investigate and document the emergence and exacerbation of undesirable per-class performance imbalances, across tasks and architectures, for almost one million categories considered across over 100K image classification models that undergo a pruning process.
We demonstrate the need for transparent reporting, inclusive of bias, fairness, and inclusion metrics, in real-life engineering decision-making around neural network pruning.
arXiv Detail & Related papers (2020-09-10T04:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.