SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering
- URL: http://arxiv.org/abs/2212.08283v3
- Date: Mon, 7 Aug 2023 08:32:54 GMT
- Title: SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering
- Authors: Feiqi Cao, Siwen Luo, Felipe Nunez, Zean Wen, Josiah Poon, Caren Han
- Abstract summary: The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
- Score: 2.8974040580489198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most TextVQA approaches focus on the integration of objects, scene texts and
question words by a simple transformer encoder. But this fails to capture the
semantic relations between different modalities. The paper proposes a Scene
Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the
semantic relations among the objects, Optical Character Recognition (OCR)
tokens and the question words. It is achieved by a TextVQA-based scene graph
that discovers the underlying semantics of an image. We created a
guided-attention module to capture the intra-modal interplay between the
language and the vision as a guidance for inter-modal interactions. To make
explicit teaching of the relations between the two modalities, we proposed and
integrated two attention modules, namely a scene graph-based semantic
relation-aware attention and a positional relation-aware attention. We
conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA.
It is shown that our SceneGATE method outperformed existing ones because of the
scene graph and its attention modules.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Image Semantic Relation Generation [0.76146285961466]
Scene graphs can distil complex image information and correct the bias of visual models using semantic-level relations.
In this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model.
arXiv Detail & Related papers (2022-10-19T16:15:19Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.