Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
- URL: http://arxiv.org/abs/2404.19228v2
- Date: Thu, 10 Oct 2024 03:01:50 GMT
- Title: Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric
- Authors: Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji,
- Abstract summary: We show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP.
We show that our proposed similarity based on weighted point clouds consistently achieves the optimal similarity.
- Score: 44.95433989446052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point clouds, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point clouds consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.
Related papers
- The Double-Ellipsoid Geometry of CLIP [4.013156524547072]
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications.
We show that text and image reside on linearly separable ellipsoid shells, not centered at the origin.
A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance.
arXiv Detail & Related papers (2024-11-21T16:27:22Z) - Efficient Fairness-Performance Pareto Front Computation [51.558848491038916]
We show that optimal fair representations possess several useful structural properties.
We then show that these approxing problems can be solved efficiently via concave programming methods.
arXiv Detail & Related papers (2024-09-26T08:46:48Z) - CLIP Adaptation by Intra-modal Overlap Reduction [1.2277343096128712]
We analyse the intra-modal overlap in image space in terms of embedding representation.
We train a lightweight adapter on a generic set of samples from the Google Open Images dataset.
arXiv Detail & Related papers (2024-09-17T16:40:58Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Asymmetric Patch Sampling for Contrastive Learning [17.922853312470398]
Asymmetric appearance between positive pair effectively reduces the risk of representation degradation in contrastive learning.
We propose a novel asymmetric patch sampling strategy for contrastive learning, to boost the appearance asymmetry for better representations.
arXiv Detail & Related papers (2023-06-05T13:10:48Z) - Counting Like Human: Anthropoid Crowd Counting on Modeling the
Similarity of Objects [92.80955339180119]
mainstream crowd counting methods regress density map and integrate it to obtain counting results.
Inspired by this, we propose a rational and anthropoid crowd counting framework.
arXiv Detail & Related papers (2022-12-02T07:00:53Z) - Correlation between Alignment-Uniformity and Performance of Dense
Contrastive Representations [11.266613717084788]
We analyze the theoretical ideas of dense contrastive learning using a standard CNN and straightforward feature matching scheme.
We discover the core principle in constructing a positive pair of dense features and empirically proved its validity.
Also, we introduce a new scalar metric that summarizes the correlation between alignment-and-uniformity and downstream performance.
arXiv Detail & Related papers (2022-10-17T08:08:37Z) - Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images.
Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph.
Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Beyond Single Instance Multi-view Unsupervised Representation Learning [21.449132256091662]
We impose more accurate instance discrimination capability by measuring the joint similarity between two randomly sampled instances.
We believe that learning joint similarity helps to improve the performance when encoded features are distributed more evenly in the latent space.
arXiv Detail & Related papers (2020-11-26T15:43:27Z) - Uncertainty-Aware Few-Shot Image Classification [118.72423376789062]
Few-shot image classification learns to recognize new categories from limited labelled data.
We propose Uncertainty-Aware Few-Shot framework for image classification.
arXiv Detail & Related papers (2020-10-09T12:26:27Z) - Unsupervised Learning of Visual Features by Contrasting Cluster
Assignments [57.33699905852397]
We propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons.
Our method simultaneously clusters the data while enforcing consistency between cluster assignments.
Our method can be trained with large and small batches and can scale to unlimited amounts of data.
arXiv Detail & Related papers (2020-06-17T14:00:42Z) - An efficient manifold density estimator for all recommendation systems [3.2981402185055213]
We propose a framework utilizing arbitrary vector representations with the property of local similarity to succinctly represent smooth probability densities.
Our approximate representation has the desirable properties of being fixed-size and having simple additive compositionality, thus being especially amenable to treatment with neural networks.
Applying E to both top-k and session-based recommendation settings, we establish new state-of-the-art results on multiple open datasets in both uni-modal and multi-modal settings.
arXiv Detail & Related papers (2020-06-02T19:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.