Learning Relation Alignment for Calibrated Cross-modal Retrieval
- URL: http://arxiv.org/abs/2105.13868v2
- Date: Tue, 1 Jun 2021 05:16:22 GMT
- Title: Learning Relation Alignment for Calibrated Cross-modal Retrieval
- Authors: Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, An Yang, Jingren
Zhou, Xu Sun, Hongxia Yang
- Abstract summary: We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
- Score: 52.760541762871505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the achievements of large-scale multimodal pre-training approaches,
cross-modal retrieval, e.g., image-text retrieval, remains a challenging task.
To bridge the semantic gap between the two modalities, previous studies mainly
focus on word-region alignment at the object level, lacking the matching
between the linguistic relation among the words and the visual relation among
the regions. The neglect of such relation consistency impairs the
contextualized representation of image-text pairs and hinders the model
performance and the interpretability. In this paper, we first propose a novel
metric, Intra-modal Self-attention Distance (ISD), to quantify the relation
consistency by measuring the semantic distance between linguistic and visual
relations. In response, we present Inter-modal Alignment on Intra-modal
Self-attentions (IAIS), a regularized training method to optimize the ISD and
calibrate intra-modal self-attentions from the two modalities mutually via
inter-modal alignment. The IAIS regularizer boosts the performance of
prevailing models on Flickr30k and MS COCO datasets by a considerable margin,
which demonstrates the superiority of our approach.
Related papers
- Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text Matching [10.709744162565274]
We propose a novel method called DIAS to bridge the modality gap from two aspects.
The method achieves 4.3%-10.2% rSum improvements on Flickr30k and MSCOCO benchmarks.
arXiv Detail & Related papers (2024-10-22T09:37:29Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition [3.5803801804085347]
We introduce Recursive Joint Cross-Modal Attention (RJCMA) to capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition.
In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities.
Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset.
arXiv Detail & Related papers (2024-03-20T15:08:43Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Cross-modal Attention Congruence Regularization for Vision-Language
Relation Alignment [105.70884254216973]
We show that relation alignment can be enforced by encouraging the directed language attention from'mug' to 'grass'
We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention.
We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.
arXiv Detail & Related papers (2022-12-20T18:53:14Z) - Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment [63.0407314271459]
The proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
Experiments show that the proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
arXiv Detail & Related papers (2022-10-09T02:24:35Z) - Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space
Using Joint Cross-Attention [15.643176705932396]
We introduce a joint cross-attentional model for A-V fusion that extracts the salient features across A-V modalities.
It computes the cross-attention weights based on correlation between the joint feature representation and that of the individual modalities.
Results indicate that our joint cross-attentional A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches.
arXiv Detail & Related papers (2022-09-19T15:01:55Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and
Intra-modal Knowledge Integration [48.01536973731182]
We introduce a new vision-and-language pretraining method called ROSITA.
It integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments.
ROSITA significantly outperforms existing state-of-the-art methods on three typical vision-and-language tasks over six benchmark datasets.
arXiv Detail & Related papers (2021-08-16T13:16:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.