The Double-Ellipsoid Geometry of CLIP
- URL: http://arxiv.org/abs/2411.14517v1
- Date: Thu, 21 Nov 2024 16:27:22 GMT
- Title: The Double-Ellipsoid Geometry of CLIP
- Authors: Meir Yossef Levi, Guy Gilboa,
- Abstract summary: Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications.
We show that text and image reside on linearly separable ellipsoid shells, not centered at the origin.
A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance.
- Score: 4.013156524547072
- License:
- Abstract: Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.
Related papers
- CLIP Adaptation by Intra-modal Overlap Reduction [1.2277343096128712]
We analyse the intra-modal overlap in image space in terms of embedding representation.
We train a lightweight adapter on a generic set of samples from the Google Open Images dataset.
arXiv Detail & Related papers (2024-09-17T16:40:58Z) - Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric [44.95433989446052]
We show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP.
We show that our proposed similarity based on weighted point clouds consistently achieves the optimal similarity.
arXiv Detail & Related papers (2024-04-30T03:15:04Z) - Is Cosine-Similarity of Embeddings Really About Similarity? [46.75365717794515]
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations.
We study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights.
We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless similarities'
arXiv Detail & Related papers (2024-03-08T16:48:20Z) - Understanding Imbalanced Semantic Segmentation Through Neural Collapse [81.89121711426951]
We show that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes.
We introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure.
Our method ranks 1st and sets a new record on the ScanNet200 test leaderboard.
arXiv Detail & Related papers (2023-01-03T13:51:51Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - Autoencoder Image Interpolation by Shaping the Latent Space [12.482988592988868]
Autoencoders represent an effective approach for computing the underlying factors characterizing datasets of different types.
We propose a regularization technique that shapes the latent representation to follow a manifold consistent with the training images.
arXiv Detail & Related papers (2020-08-04T12:32:54Z) - Making Affine Correspondences Work in Camera Geometry Computation [62.7633180470428]
Local features provide region-to-region rather than point-to-point correspondences.
We propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline.
Experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times.
arXiv Detail & Related papers (2020-07-20T12:07:48Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z) - Learning Flat Latent Manifolds with VAEs [16.725880610265378]
We propose an extension to the framework of variational auto-encoders, where the Euclidean metric is a proxy for the similarity between data points.
We replace the compact prior typically used in variational auto-encoders with a recently presented, more expressive hierarchical one.
We evaluate our method on a range of data-sets, including a video-tracking benchmark.
arXiv Detail & Related papers (2020-02-12T09:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.