CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure
for Vision-Language Retrieval
- URL: http://arxiv.org/abs/2304.07567v2
- Date: Fri, 18 Aug 2023 02:32:17 GMT
- Title: CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure
for Vision-Language Retrieval
- Authors: Yang Yang, Zhongtian Fu, Xiangyu Wu, Wenjie Li
- Abstract summary: We propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR)
CoVLR aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks.
It can improve single-modal retrieval accuracy whilst preserving crossmodal retrieval capacity compared with the baselines.
- Score: 11.49620599530686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current vision-language retrieval aims to perform cross-modal instance
search, in which the core idea is to learn the consistent visionlanguage
representations. Although the performance of cross-modal retrieval has greatly
improved with the development of deep models, we unfortunately find that
traditional hard consistency may destroy the original relationships among
single-modal instances, leading the performance degradation for single-modal
retrieval. To address this challenge, in this paper, we experimentally observe
that the vision-language divergence may cause the existence of strong and weak
modalities, and the hard cross-modal consistency cannot guarantee that strong
modal instances' relationships are not affected by weak modality, resulting in
the strong modal instances' relationships perturbed despite learned consistent
representations.To this end, we propose a novel and directly Coordinated
VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and
alleviate the desynchrony problem between the cross-modal alignment and
single-modal cluster-preserving tasks. CoVLR addresses this challenge by
developing an effective meta-optimization based strategy, in which the
cross-modal consistency objective and the intra-modal relation preserving
objective are acted as the meta-train and meta-test tasks, thereby CoVLR
encourages both tasks to be optimized in a coordinated way. Consequently, we
can simultaneously insure cross-modal consistency and intra-modal structure.
Experiments on different datasets validate CoVLR can improve single-modal
retrieval accuracy whilst preserving crossmodal retrieval capacity compared
with the baselines.
Related papers
- Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation [44.03643049208946]
Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality.
The primary objective is to learn cross-modal matching representations in a latent common space.
The impact of imbalance on retrieval performance remains an open question.
arXiv Detail & Related papers (2024-12-14T09:10:36Z) - On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning.
We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning.
Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z) - Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning [66.28872204574648]
Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information.
Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality.
This paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency.
arXiv Detail & Related papers (2024-08-01T06:04:44Z) - Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm [14.517103323409307]
Sim-to-real gap represents disparity between training and testing environments.
A promising approach to addressing this challenge is distributionally robust RL.
We tackle robust RL via interactive data collection and present an algorithm with a provable sample complexity guarantee.
arXiv Detail & Related papers (2024-04-04T16:40:22Z) - Masked Contrastive Reconstruction for Cross-modal Medical Image-Report
Retrieval [3.5314225883644945]
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks.
We propose an efficient framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks.
This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time.
arXiv Detail & Related papers (2023-12-26T01:14:10Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning [23.472951216815765]
Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
arXiv Detail & Related papers (2022-06-21T07:29:37Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.