Is CLIP ideal? No. Can we fix it? Yes!
- URL: http://arxiv.org/abs/2503.08723v1
- Date: Mon, 10 Mar 2025 23:42:04 GMT
- Title: Is CLIP ideal? No. Can we fix it? Yes!
- Authors: Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona,
- Abstract summary: Contrastive Language-Image Pre-Training is a popular method for learning multimodal latent spaces with well-organized semantics.<n>Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions.<n>We propose Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models.
- Score: 30.71718499767702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
Related papers
- Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer representations.<n>SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation [72.47110803885235]
We introduce a novel framework named Cascade-CLIP for zero-shot semantic segmentation.
Our framework achieves superior zero-shot performance on segmentation benchmarks like COCO-Stuff, Pascal-VOC, and Pascal-Context.
arXiv Detail & Related papers (2024-06-02T08:32:51Z) - CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation [31.264574799748903]
We propose an open-vocabulary semantic segmentation method, which does not require any annotations.
We show that the used self-supervised feature properties can directly be learnt from CLIP features.
Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference.
arXiv Detail & Related papers (2023-12-19T17:40:27Z) - CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-Free [12.15899043709721]
We propose an open-vocabulary semantic segmentation method, dubbed CLIP-DIY.
It exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map.
We obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
arXiv Detail & Related papers (2023-09-25T16:52:59Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - LidarCLIP or: How I Learned to Talk to Point Clouds [3.0623865942628594]
LidarCLIP is a mapping from automotive point clouds to a pre-existing CLIP embedding space.
We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is on par with image-based retrieval.
We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin.
arXiv Detail & Related papers (2022-12-13T19:02:35Z) - ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation [35.60888272729273]
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme.
While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost.
We propose a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level.
arXiv Detail & Related papers (2022-12-07T12:05:00Z) - CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z) - DenseCLIP: Extract Free Dense Labels from CLIP [130.3830819077699]
Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition.
DenseCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
Our finding suggests that DenseCLIP can serve as a new reliable source of supervision for dense prediction tasks.
arXiv Detail & Related papers (2021-12-02T09:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.