SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering
- URL: http://arxiv.org/abs/2310.01842v1
- Date: Tue, 3 Oct 2023 07:14:53 GMT
- Title: SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering
- Authors: Bruno Souza and Marius Aasan and Helio Pedrini and Ad\'in Ram\'irez
Rivera
- Abstract summary: Scene graphs have emerged as a useful tool for multimodal image analysis.
Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images.
Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The intersection of vision and language is of major interest due to the
increased focus on seamless integration between recognition and reasoning.
Scene graphs (SGs) have emerged as a useful tool for multimodal image analysis,
showing impressive performance in tasks such as Visual Question Answering
(VQA). In this work, we demonstrate that despite the effectiveness of scene
graphs in VQA tasks, current methods that utilize idealized annotated scene
graphs struggle to generalize when using predicted scene graphs extracted from
images. To address this issue, we introduce the SelfGraphVQA framework. Our
approach extracts a scene graph from an input image using a pre-trained scene
graph generator and employs semantically-preserving augmentation with
self-supervised techniques. This method improves the utilization of graph
representations in VQA tasks by circumventing the need for costly and
potentially biased annotated data. By creating alternative views of the
extracted graphs through image augmentations, we can learn joint embeddings by
optimizing the informational content in their representations using an
un-normalized contrastive approach. As we work with SGs, we experiment with
three distinct maximization strategies: node-wise, graph-wise, and
permutation-equivariant regularization. We empirically showcase the
effectiveness of the extracted scene graph for VQA and demonstrate that these
approaches enhance overall performance by highlighting the significance of
visual information. This offers a more practical solution for VQA tasks that
rely on SGs for complex reasoning questions.
Related papers
- Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [61.93058781222079]
We develop a flexible question-answering framework targeting real-world textual graphs.
We introduce the first retrieval-augmented generation (RAG) approach for general textual graphs.
G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem.
arXiv Detail & Related papers (2024-02-12T13:13:04Z) - Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient
Scene Graph Generation [0.7851536646859476]
We introduce the task of Efficient Scene Graph Generation (SGG) that prioritizes the generation of relevant relations.
We present a new dataset, VG150-curated, based on the annotations of the popular Visual Genome dataset.
We show through a set of experiments that this dataset contains more high-quality and diverse annotations than the one usually use in SGG.
arXiv Detail & Related papers (2023-05-30T00:55:49Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question
Answering [13.886692497676659]
Graphhopper is a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques.
We derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships.
A reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths.
arXiv Detail & Related papers (2021-07-13T18:33:04Z) - A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval [4.159666152160874]
Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
arXiv Detail & Related papers (2021-06-04T10:33:14Z) - Understanding the Role of Scene Graphs in Visual Question Answering [26.02889386248289]
We conduct experiments on the GQA dataset which presents a challenging set of questions requiring counting, compositionality and advanced reasoning capability.
We adopt image + question architectures for use with scene graphs, evaluate various scene graph generation techniques for unseen images, propose a training curriculum to leverage human-annotated and auto-generated scene graphs.
We present a multi-faceted study into the use of scene graphs for Visual Question Answering, making this work the first of its kind.
arXiv Detail & Related papers (2021-01-14T07:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.