Delving into the Openness of CLIP
- URL: http://arxiv.org/abs/2206.01986v3
- Date: Sun, 7 May 2023 15:04:28 GMT
- Title: Delving into the Openness of CLIP
- Authors: Shuhuai Ren, Lei Li, Xuancheng Ren, Guangxiang Zhao, Xu Sun
- Abstract summary: We evaluate the openness of Contrastive Language-Image Pre-training models.
Our evaluation shows that CLIP-like models are not truly open, and their performance deteriorates as the vocabulary expands.
Our investigation reveals that the overestimation of openness is due to confusion among competing text features.
- Score: 35.371811948506796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) formulates image
classification as an image-to-text matching task, i.e., matching images to the
corresponding natural language descriptions instead of discrete category IDs.
This allows for open-vocabulary visual recognition, where the model can
recognize images from an open class set (also known as an open vocabulary) in a
zero-shot manner. However, evaluating the openness of CLIP-like models is
challenging, as the models are open to arbitrary vocabulary in theory, but
their accuracy varies in practice. To address this, we resort to an incremental
perspective to assess the openness through vocabulary expansions, and define
extensibility to measure a model's ability to handle novel classes. Our
evaluation shows that CLIP-like models are not truly open, and their
performance deteriorates as the vocabulary expands. We further dissect the
feature space of CLIP from the perspectives of representation alignment and
uniformity. Our investigation reveals that the overestimation of openness is
due to confusion among competing text features, rather than a failure to
capture the similarity between image features and text features of novel
classes. We hope that our investigation and analysis will facilitate future
research on the CLIP openness issue.
Related papers
- Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge [20.09852220432504]
Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space.
This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities.
arXiv Detail & Related papers (2024-10-16T20:18:21Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning [46.25534556546322]
We propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions.
Our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
arXiv Detail & Related papers (2024-06-17T06:28:58Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.