It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
- URL: http://arxiv.org/abs/2405.18570v3
- Date: Thu, 6 Jun 2024 20:29:21 GMT
- Title: It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
- Authors: Abrar Fahim, Alex Murphy, Alona Fyshe,
- Abstract summary: A modality gap has been reported in two-encoder contrastive models like CLIP.
We show that even when accounting for all these factors, the contrastive loss actually creates a gap during training.
We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space.
- Score: 4.437949196235149
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.
Related papers
- Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP [22.076206386214565]
Contrastive Language--Image Pre-training has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks.
From a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap.
We show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings.
arXiv Detail & Related papers (2024-06-25T15:24:02Z) - On mitigating stability-plasticity dilemma in CLIP-guided image morphing
via geodesic distillation loss [38.31276786740577]
Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results.
Existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images.
Our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.
arXiv Detail & Related papers (2024-01-19T07:06:58Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - UniCLIP: Unified Framework for Contrastive Language-Image Pre-training [62.97551575508387]
We propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training.
UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space.
UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks.
arXiv Detail & Related papers (2022-09-27T14:36:16Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - DGSS : Domain Generalized Semantic Segmentation using Iterative Style
Mining and Latent Representation Alignment [38.05196030226661]
Current state-of-the-art (SoTA) have proposed different mechanisms to bridge the domain gap, but they still perform poorly in low illumination conditions.
We propose a two-step framework wherein we first identify an adversarial style that maximizes the domain gap between stylized and source images.
We then propose a style mixing mechanism wherein the same objects from different styles are mixed to construct a new training image.
arXiv Detail & Related papers (2022-02-26T13:54:57Z) - Contrastive Feature Loss for Image Prediction [55.373404869092866]
Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result.
We introduce an information theory based approach to measuring similarity between two images.
We show that our formulation boosts the perceptual realism of output images when used as a drop-in replacement for the L1 loss.
arXiv Detail & Related papers (2021-11-12T20:39:52Z) - Pixel Contrastive-Consistent Semi-Supervised Semantic Segmentation [22.920856071095915]
We present a novel semi-supervised semantic segmentation method which jointly achieves two desiderata of segmentation model regularities.
We leverage the pixel-level L2 loss and the pixel contrastive loss for the two purposes respectively.
arXiv Detail & Related papers (2021-08-20T07:04:33Z) - The Spatially-Correlative Loss for Various Image Translation Tasks [69.62228639870114]
We propose a novel spatially-correlative loss that is simple, efficient and yet effective for preserving scene structure consistency.
Previous methods attempt this by using pixel-level cycle-consistency or feature-level matching losses.
We show distinct improvement over baseline models in all three modes of unpaired I2I translation: single-modal, multi-modal, and even single-image translation.
arXiv Detail & Related papers (2021-04-02T02:13:30Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.