Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching
- URL: http://arxiv.org/abs/2410.16853v1
- Date: Tue, 22 Oct 2024 09:37:29 GMT
- Title: Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching
- Authors: Xiang Ma, Xuemei Li, Lexin Fang, Caiming Zhang,
- Abstract summary: We propose a novel method called DIAS to bridge the modality gap from two aspects.
The method achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.
- Score: 10.709744162565274
- License:
- Abstract: Many contrastive learning based models have achieved advanced performance in image-text matching tasks. The key of these models lies in analyzing the correlation between image-text pairs, which involves cross-modal interaction of embeddings in corresponding dimensions. However, the embeddings of different modalities are from different models or modules, and there is a significant modality gap. Directly interacting such embeddings lacks rationality and may capture inaccurate correlation. Therefore, we propose a novel method called DIAS to bridge the modality gap from two aspects: (1) We align the information representation of embeddings from different modalities in corresponding dimension to ensure the correlation calculation is based on interactions of similar information. (2) The spatial constraints of inter- and intra-modalities unmatched pairs are introduced to ensure the effectiveness of semantic alignment of the model. Besides, a sparse correlation algorithm is proposed to select strong correlated spatial relationships, enabling the model to learn more significant features and avoid being misled by weak correlation. Extensive experiments demonstrate the superiority of DIAS, achieving 4.3\%-10.2\% rSum improvements on Flickr30k and MSCOCO benchmarks.
Related papers
- Towards Deconfounded Image-Text Matching with Causal Inference [36.739004282369656]
We propose an innovative Deconfounded Causal Inference Network (DCIN) for image-text matching task.
DCIN decomposes the intra- and inter-modal confounders and incorporates them into the encoding stage of visual and textual features.
It can learn causality instead of spurious correlations caused by dataset bias.
arXiv Detail & Related papers (2024-08-22T11:04:28Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Multi-scale Target-Aware Framework for Constrained Image Splicing
Detection and Localization [11.803255600587308]
We propose a multi-scale target-aware framework to couple feature extraction and correlation matching in a unified pipeline.
Our approach can effectively promote the collaborative learning of related patches, and perform mutual promotion of feature learning and correlation matching.
Our experiments demonstrate that our model, which uses a unified pipeline, outperforms state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2023-08-18T07:38:30Z) - Discriminative Co-Saliency and Background Mining Transformer for
Co-Salient Object Detection [111.04994415248736]
We propose a Discriminative co-saliency and background Mining Transformer framework (DMT)
We use two types of pre-defined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token correlation and co-saliency token-to-token correlation modules.
Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-30T15:56:47Z) - FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced
Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images.
We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity.
In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z) - Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval.
We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions.
Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - COBRA: Contrastive Bi-Modal Representation Algorithm [43.33840912256077]
We present a novel framework that aims to train two modalities in a joint fashion inspired by Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE) paradigms.
We empirically show that this framework reduces the modality gap significantly and generates a robust and task agnostic joint-embedding space.
We outperform existing work on four diverse downstream tasks spanning across seven benchmark cross-modal datasets.
arXiv Detail & Related papers (2020-05-07T18:20:12Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.