Related papers: Closing the Modality Gap for Mixed Modality Search

Closing the Modality Gap for Mixed Modality Search

URL: http://arxiv.org/abs/2507.19054v1
Date: Fri, 25 Jul 2025 08:15:28 GMT
Title: Closing the Modality Gap for Mixed Modality Search
Authors: Binxu Li, Yuhui Zhang, Xiaohan Wang, Weixin Liang, Ludwig Schmidt, Serena Yeung-Levy,
Abstract summary: We investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task.<n>Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space.<n>We propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space.
Score: 47.00880557856163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification [11.270267165348626]
We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement. We propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model.
arXiv Detail & Related papers (2024-09-23T04:49:08Z)
Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval. CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z)
Towards Optimal Aggregation of Varying Range Dependencies in Haze Removal [17.29370328189668]
Haze removal aims to restore a clear image from a hazy input.<n>Existing methods have shown significant efficacy by capturing either short-range dependencies for local detail preservation or long-range dependencies for global context modeling.<n>We propose bfDehazeMatic, which captures both short- and long-range dependencies through dual-path design for improved restoration.
arXiv Detail & Related papers (2024-08-22T11:51:50Z)
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z)
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP [22.076206386214565]
Contrastive Language--Image Pre-training has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. From a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. We show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings.
arXiv Detail & Related papers (2024-06-25T15:24:02Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Re-Identification [10.64115914599574]
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions. The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps. In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
arXiv Detail & Related papers (2024-01-18T14:27:01Z)
Efficient Bilateral Cross-Modality Cluster Matching for Unsupervised Visible-Infrared Person ReID [56.573905143954015]
We propose a novel bilateral cluster matching-based learning framework to reduce the modality gap by matching cross-modality clusters. Under such a supervisory signal, a Modality-Specific and Modality-Agnostic (MSMA) contrastive learning framework is proposed to align features jointly at a cluster-level. Experiments on the public SYSU-MM01 and RegDB datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:27:46Z)
CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z)
Clustering-Induced Generative Incomplete Image-Text Clustering (CIGIT-C) [3.2062075983668343]
We propose a Clustering-Induced Generative Incomplete Image-Text Clustering(CIGIT-C) network to address the challenges above. We first use modality-specific encoders to map original features to more distinctive subspaces. The latent connections between intra and inter-modalities are thoroughly explored.
arXiv Detail & Related papers (2022-09-28T01:19:52Z)
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [87.3821932795969]
Cross-grained contrast is the contrast between coarse-grained representations and fine-grained representations. X-CLIP is a novel multi-grained contrastive model for video-text retrieval. X-CLIP achieves outstanding performance on five widely-used video-text retrieval datasets.
arXiv Detail & Related papers (2022-07-15T04:23:42Z)
One-Shot Adaptation of GAN in Just One CLIP [51.188396199083336]
We present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization. We show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-03-17T13:03:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.