Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
- URL: http://arxiv.org/abs/2406.17639v3
- Date: Mon, 16 Sep 2024 15:32:11 GMT
- Title: Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
- Authors: Sedigheh Eslami, Gerard de Melo,
- Abstract summary: Contrastive Language--Image Pre-training has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks.
From a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap.
We show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings.
- Score: 22.076206386214565
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering three main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? 3. How do these gap reduction approaches affect the downstream performance? We design AlignCLIP, in order to answer these questions and through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while improving the performance across several zero-shot and fine-tuning downstream evaluations.
Related papers
- Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search [27.29822608207957]
We develop an encoder-decoder paradigm to preserve feature selection knowledge into a continuous embedding space.<n>We also employ a policy-based reinforcement learning approach to guide the exploration of the embedding space.
arXiv Detail & Related papers (2025-05-16T18:08:16Z) - Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks.
LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content.
We propose an Inter-Modality Correlation Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner.
arXiv Detail & Related papers (2025-01-03T17:56:28Z) - Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.
Existing SHGL methods encounter two significant limitations.
We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties.
We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block.
The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z) - Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling [11.129453244307369]
FG-SBIR aims to minimize the distance between sketches and corresponding images in the embedding space.
We propose an effective approach to narrow the gap between the two domains.
It mainly facilitates unified mutual information sharing both intra- and inter-samples.
arXiv Detail & Related papers (2024-06-17T13:49:12Z) - It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap [4.437949196235149]
A modality gap has been reported in two-encoder contrastive models like CLIP.
We show that even when accounting for all these factors, the contrastive loss actually creates a gap during training.
We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space.
arXiv Detail & Related papers (2024-05-28T20:28:07Z) - Multi-Grained Cross-modal Alignment for Learning Open-vocabulary
Semantic Segmentation from Text Supervision [23.931443799102663]
We introduce a Multi-Grained Cross-modal Alignment (MGCA) framework to bridge the granularity gap without any dense annotations.
Specifically, MGCA constructs pseudo multi-granular semantic correspondences upon image-text pairs.
Our method achieves significant advancements over state-of-the-art methods, demonstrating its effectiveness and efficiency.
arXiv Detail & Related papers (2024-03-06T13:43:36Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [62.97551575508387]
We propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training.
UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space.
UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks.
arXiv Detail & Related papers (2022-09-27T14:36:16Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Encouraging Disentangled and Convex Representation with Controllable
Interpolation Regularization [15.725515910594725]
We focus on controllable disentangled representation learning (C-Dis-RL)
We propose a simple yet efficient method: Controllable Interpolation Regularization (CIR)
arXiv Detail & Related papers (2021-12-06T16:52:07Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Margin Preserving Self-paced Contrastive Learning Towards Domain
Adaptation for Medical Image Segmentation [51.93711960601973]
We propose a novel margin preserving self-paced contrastive Learning model for cross-modal medical image segmentation.
With the guidance of progressively refined semantic prototypes, a novel margin preserving contrastive loss is proposed to boost the discriminability of embedded representation space.
Experiments on cross-modal cardiac segmentation tasks demonstrate that MPSCL significantly improves semantic segmentation performance.
arXiv Detail & Related papers (2021-03-15T15:23:10Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z) - COBRA: Contrastive Bi-Modal Representation Algorithm [43.33840912256077]
We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms.
We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space.
We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
arXiv Detail & Related papers (2020-05-07T18:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.