Cross-modal Attention Congruence Regularization for Vision-Language
Relation Alignment
- URL: http://arxiv.org/abs/2212.10549v2
- Date: Tue, 4 Jul 2023 13:18:29 GMT
- Title: Cross-modal Attention Congruence Regularization for Vision-Language
Relation Alignment
- Authors: Rohan Pandey, Rulin Shao, Paul Pu Liang, Ruslan Salakhutdinov,
Louis-Philippe Morency
- Abstract summary: We show that relation alignment can be enforced by encouraging the directed language attention from'mug' to 'grass'
We prove that this notion of soft relation alignment is equivalent to enforcing congruence between vision and language attention.
We apply our Cross-modal Attention Congruence Regularization (CACR) loss to UNITER and improve on the state-of-the-art approach to Winoground.
- Score: 105.70884254216973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress towards scaling up multimodal vision-language models,
these models are still known to struggle on compositional generalization
benchmarks such as Winoground. We find that a critical component lacking from
current vision-language models is relation-level alignment: the ability to
match directional semantic relations in text (e.g., "mug in grass") with
spatial relationships in the image (e.g., the position of the mug relative to
the grass). To tackle this problem, we show that relation alignment can be
enforced by encouraging the directed language attention from 'mug' to 'grass'
(capturing the semantic relation 'in') to match the directed visual attention
from the mug to the grass. Tokens and their corresponding objects are softly
identified using the cross-modal attention. We prove that this notion of soft
relation alignment is equivalent to enforcing congruence between vision and
language attention matrices under a 'change of basis' provided by the
cross-modal attention matrix. Intuitively, our approach projects visual
attention into the language attention space to calculate its divergence from
the actual language attention, and vice versa. We apply our Cross-modal
Attention Congruence Regularization (CACR) loss to UNITER and improve on the
state-of-the-art approach to Winoground.
Related papers
- Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching [7.7559623054251]
Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
arXiv Detail & Related papers (2024-06-05T13:10:55Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.