Cross-Modality Relevance for Reasoning on Language and Vision
- URL: http://arxiv.org/abs/2005.06035v1
- Date: Tue, 12 May 2020 20:17:25 GMT
- Title: Cross-Modality Relevance for Reasoning on Language and Vision
- Authors: Chen Zheng, Quan Guo, Parisa Kordjamshidi
- Abstract summary: This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR)
We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task.
Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results.
- Score: 22.41781462637622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work deals with the challenge of learning and reasoning over language
and vision data for the related downstream tasks such as visual question
answering (VQA) and natural language for visual reasoning (NLVR). We design a
novel cross-modality relevance module that is used in an end-to-end framework
to learn the relevance representation between components of various input
modalities under the supervision of a target task, which is more generalizable
to unobserved data compared to merely reshaping the original representation
space. In addition to modeling the relevance between the textual entities and
visual entities, we model the higher-order relevance between entity relations
in the text and object relations in the image. Our proposed approach shows
competitive performance on two different language and vision tasks using public
benchmarks and improves the state-of-the-art published results. The learned
alignments of input spaces and their relevance representations by NLVR task
boost the training efficiency of VQA task.
Related papers
- Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities [18.859309032300402]
We investigate how the integration of information from image and text modalities influences the performance and behavior of Visual Language Model (VLM) predictions.
We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task.
Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence.
arXiv Detail & Related papers (2024-10-02T16:02:02Z) - Towards Flexible Visual Relationship Segmentation [25.890273232954055]
Visual relationship understanding has been studied separately in human-object interaction detection, scene graph generation, and relationships referring tasks.
We propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation.
Our framework outperforms existing models in standard, promptable, and open-vocabulary tasks.
arXiv Detail & Related papers (2024-08-15T17:57:38Z) - Visual In-Context Learning for Large Vision-Language Models [62.5507897575317]
In Large Visual Language Models (LVLMs) the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities.
We introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition.
Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations.
arXiv Detail & Related papers (2024-02-18T12:43:38Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models [2.1320960069210484]
This work studies multimodal learning in context of visually grounded speech (VGS) models.
We introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words.
We show that cross-modal attention helps the model to achieve higher semantic cross-modal retrieval performance.
arXiv Detail & Related papers (2021-07-05T12:54:05Z) - Learning Relation Alignment for Calibrated Cross-modal Retrieval [52.760541762871505]
We propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations.
We present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions mutually via inter-modal alignment.
arXiv Detail & Related papers (2021-05-28T14:25:49Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - Linguistically-aware Attention for Reducing the Semantic-Gap in
Vision-Language Tasks [9.462808515258464]
We propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors.
LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process.
We apply and demonstrate the effectiveness of LAT in three Vision-language (V-L) tasks: Counting-VQA, VQA, and Image captioning.
arXiv Detail & Related papers (2020-08-18T16:29:49Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.