Related papers: Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information

URL: http://arxiv.org/abs/2404.19228v1
Date: Tue, 30 Apr 2024 03:15:04 GMT
Title: Understanding Multimodal Contrastive Learning Through Pointwise Mutual Information
Authors: Toshimitsu Uesaka, Taiji Suzuki, Yuhta Takida, Chieh-Hsin Lai, Naoki Murata, Yuki Mitsufuji,
Abstract summary: We show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions. We also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability.
Score: 44.95433989446052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal representation learning to integrate different modalities, such as text, vision, and audio is important for real-world applications. The symmetric InfoNCE loss proposed in CLIP is a key concept in multimodal representation learning. In this work, we provide a theoretical understanding of the symmetric InfoNCE loss through the lens of the pointwise mutual information and show that encoders that achieve the optimal similarity in the pretraining provide a good representation for downstream classification tasks under mild assumptions. Based on our theoretical results, we also propose a new similarity metric for multimodal contrastive learning by utilizing a nonlinear kernel to enrich the capability. To verify the effectiveness of the proposed method, we demonstrate pretraining of multimodal representation models on the Conceptual Caption datasets and evaluate zero-shot classification and linear classification on common benchmark datasets.

Related papers

The Double-Ellipsoid Geometry of CLIP [4.013156524547072]
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications. We show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance.
arXiv Detail & Related papers (2024-11-21T16:27:22Z)
Efficient Fairness-Performance Pareto Front Computation [51.558848491038916]
We show that optimal fair representations possess several useful structural properties. We then show that these approxing problems can be solved efficiently via concave programming methods.
arXiv Detail & Related papers (2024-09-26T08:46:48Z)
CLIP Adaptation by Intra-modal Overlap Reduction [1.2277343096128712]
We analyse the intra-modal overlap in image space in terms of embedding representation. We train a lightweight adapter on a generic set of samples from the Google Open Images dataset.
arXiv Detail & Related papers (2024-09-17T16:40:58Z)
Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks. We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes. Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z)
Asymmetric Patch Sampling for Contrastive Learning [17.922853312470398]
Asymmetric appearance between positive pair effectively reduces the risk of representation degradation in contrastive learning. We propose a novel asymmetric patch sampling strategy for contrastive learning, to boost the appearance asymmetry for better representations.
arXiv Detail & Related papers (2023-06-05T13:10:48Z)
Counting Like Human: Anthropoid Crowd Counting on Modeling the Similarity of Objects [92.80955339180119]
mainstream crowd counting methods regress density map and integrate it to obtain counting results. Inspired by this, we propose a rational and anthropoid crowd counting framework.
arXiv Detail & Related papers (2022-12-02T07:00:53Z)
Correlation between Alignment-Uniformity and Performance of Dense Contrastive Representations [11.266613717084788]
We analyze the theoretical ideas of dense contrastive learning using a standard CNN and straightforward feature matching scheme. We discover the core principle in constructing a positive pair of dense features and empirically proved its validity. Also, we introduce a new scalar metric that summarizes the correlation between alignment-and-uniformity and downstream performance.
arXiv Detail & Related papers (2022-10-17T08:08:37Z)
Attributable Visual Similarity Learning [90.69718495533144]
This paper proposes an attributable visual similarity learning (AVSL) framework for a more accurate and explainable similarity measure between images. Motivated by the human semantic similarity cognition, we propose a generalized similarity learning paradigm to represent the similarity between two images with a graph. Experiments on the CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate significant improvements over existing deep similarity learning methods.
arXiv Detail & Related papers (2022-03-28T17:35:31Z)
Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method. Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features. On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z)
Beyond Single Instance Multi-view Unsupervised Representation Learning [21.449132256091662]
We impose more accurate instance discrimination capability by measuring the joint similarity between two randomly sampled instances. We believe that learning joint similarity helps to improve the performance when encoded features are distributed more evenly in the latent space.
arXiv Detail & Related papers (2020-11-26T15:43:27Z)
Uncertainty-Aware Few-Shot Image Classification [118.72423376789062]
Few-shot image classification learns to recognize new categories from limited labelled data. We propose Uncertainty-Aware Few-Shot framework for image classification.
arXiv Detail & Related papers (2020-10-09T12:26:27Z)
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [57.33699905852397]
We propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Our method simultaneously clusters the data while enforcing consistency between cluster assignments. Our method can be trained with large and small batches and can scale to unlimited amounts of data.
arXiv Detail & Related papers (2020-06-17T14:00:42Z)
An efficient manifold density estimator for all recommendation systems [3.2981402185055213]
We propose a framework utilizing arbitrary vector representations with the property of local similarity to succinctly represent smooth probability densities. Our approximate representation has the desirable properties of being fixed-size and having simple additive compositionality, thus being especially amenable to treatment with neural networks. Applying E to both top-k and session-based recommendation settings, we establish new state-of-the-art results on multiple open datasets in both uni-modal and multi-modal settings.
arXiv Detail & Related papers (2020-06-02T19:20:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.