Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
Answering
- URL: http://arxiv.org/abs/2107.06325v1
- Date: Tue, 13 Jul 2021 18:33:04 GMT
- Title: Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
Answering
- Authors: Rajat Koner, Hang Li, Marcel Hildebrandt, Deepan Das, Volker Tresp,
Stephan G\"unnemann
- Abstract summary: Graphhopper is a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques.
We derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships.
A reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths.
- Score: 13.886692497676659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) is concerned with answering free-form
questions about an image. Since it requires a deep semantic and linguistic
understanding of the question and the ability to associate it with various
objects that are present in the image, it is an ambitious task and requires
multi-modal reasoning from both computer vision and natural language
processing. We propose Graphhopper, a novel method that approaches the task by
integrating knowledge graph reasoning, computer vision, and natural language
processing techniques. Concretely, our method is based on performing
context-driven, sequential reasoning based on the scene entities and their
semantic and spatial relationships. As a first step, we derive a scene graph
that describes the objects in the image, as well as their attributes and their
mutual relationships. Subsequently, a reinforcement learning agent is trained
to autonomously navigate in a multi-hop manner over the extracted scene graph
to generate reasoning paths, which are the basis for deriving answers. We
conduct an experimental study on the challenging dataset GQA, based on both
manually curated and automatically generated scene graphs. Our results show
that we keep up with a human performance on manually curated scene graphs.
Moreover, we find that Graphhopper outperforms another state-of-the-art scene
graph reasoning model on both manually curated and automatically generated
scene graphs by a significant margin.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [61.93058781222079]
We develop a flexible question-answering framework targeting real-world textual graphs.
We introduce the first retrieval-augmented generation (RAG) approach for general textual graphs.
G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem.
arXiv Detail & Related papers (2024-02-12T13:13:04Z) - SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering [0.0]
Scene graphs have emerged as a useful tool for multimodal image analysis.
Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images.
Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
arXiv Detail & Related papers (2023-10-03T07:14:53Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Unconditional Scene Graph Generation [72.53624470737712]
We develop a deep auto-regressive model called SceneGraphGen which can learn the probability distribution over labelled and directed graphs.
We show that the scene graphs generated by SceneGraphGen are diverse and follow the semantic patterns of real-world scenes.
arXiv Detail & Related papers (2021-08-12T17:57:16Z) - A Comprehensive Survey of Scene Graphs: Generation and Application [42.07469181785126]
Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene.
No relatively systematic survey of scene graphs exists at present.
arXiv Detail & Related papers (2021-03-17T04:24:20Z) - Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability.
We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs.
We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z) - Scene Graph Reasoning for Visual Question Answering [23.57543808056452]
We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene.
A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers.
arXiv Detail & Related papers (2020-07-02T13:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.