It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
- URL: http://arxiv.org/abs/2405.18570v3
- Date: Thu, 6 Jun 2024 20:29:21 GMT
- Title: It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
- Authors: Abrar Fahim, Alex Murphy, Alona Fyshe,
- Abstract summary: A modality gap has been reported in two-encoder contrastive models like CLIP.
We show that even when accounting for all these factors, the contrastive loss actually creates a gap during training.
We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space.
- Score: 4.437949196235149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.
Related papers
- AIR: Zero-shot Generative Model Adaptation with Iterative Refinement [27.322307161825844]
Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance.<n>Central to recent ZSGM approaches are directional loss which use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP.
arXiv Detail & Related papers (2025-06-12T17:00:50Z) - Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.
It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z) - SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning [0.0]
We introduce a novel anchor-free contrastive learning (L) method leveraging our proposed Similarity-Orthogonality (SimO) loss.
Our approach minimizes a semi-metric discriminative loss function that simultaneously optimize two key objectives.
We provide visualizations that demonstrate the impact of SimO loss on the embedding space.
arXiv Detail & Related papers (2024-10-07T17:41:10Z) - CLIP Adaptation by Intra-modal Overlap Reduction [1.2277343096128712]
We analyse the intra-modal overlap in image space in terms of embedding representation.
We train a lightweight adapter on a generic set of samples from the Google Open Images dataset.
arXiv Detail & Related papers (2024-09-17T16:40:58Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP [22.076206386214565]
Contrastive Language--Image Pre-training has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks.
From a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap.
We show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings.
arXiv Detail & Related papers (2024-06-25T15:24:02Z) - Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment [4.682326604942316]
We focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks.<n>There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.<n>We propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z) - On mitigating stability-plasticity dilemma in CLIP-guided image morphing
via geodesic distillation loss [38.31276786740577]
Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results.
Existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images.
Our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.
arXiv Detail & Related papers (2024-01-19T07:06:58Z) - UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [62.97551575508387]
We propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training.
UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space.
UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks.
arXiv Detail & Related papers (2022-09-27T14:36:16Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - Contrastive Feature Loss for Image Prediction [55.373404869092866]
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result.
We introduce an information theory based approach to measuring similarity between two images.
We show that our formulation boosts the perceptual realism of output images when used as a drop-in replacement for the L1 loss.
arXiv Detail & Related papers (2021-11-12T20:39:52Z) - The Spatially-Correlative Loss for Various Image Translation Tasks [69.62228639870114]
We propose a novel spatially-correlative loss that is simple, efficient and yet effective for preserving scene structure consistency.
Previous methods attempt this by using pixel-level cycle-consistency or feature-level matching losses.
We show distinct improvement over baseline models in all three modes of unpaired I2I translation: single-modal, multi-modal, and even single-image translation.
arXiv Detail & Related papers (2021-04-02T02:13:30Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.