Question-Driven Graph Fusion Network For Visual Question Answering
- URL: http://arxiv.org/abs/2204.00975v1
- Date: Sun, 3 Apr 2022 03:02:03 GMT
- Title: Question-Driven Graph Fusion Network For Visual Question Answering
- Authors: Yuxi Qian, Yuncong Hu, Ruonan Wang, Fangxiang Feng and Xiaojie Wang
- Abstract summary: We propose a Question-Driven Graph Fusion Network (QD-GFN)
It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs.
Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets.
- Score: 15.098694655795168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Visual Question Answering (VQA) models have explored various visual
relationships between objects in the image to answer complex questions, which
inevitably introduces irrelevant information brought by inaccurate object
detection and text grounding. To address the problem, we propose a
Question-Driven Graph Fusion Network (QD-GFN). It first models semantic,
spatial, and implicit visual relations in images by three graph attention
networks, then question information is utilized to guide the aggregation
process of the three graphs, further, our QD-GFN adopts an object filtering
mechanism to remove question-irrelevant objects contained in the image.
Experiment results demonstrate that our QD-GFN outperforms the prior
state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows
that both the novel graph aggregation method and object filtering mechanism
play a significant role in improving the performance of the model.
Related papers
- Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question
Answering [16.502197578954917]
graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features.
We propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA.
arXiv Detail & Related papers (2023-07-25T04:41:32Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Joint learning of object graph and relation graph for visual question
answering [19.97265717398179]
We introduce a novel Dual Message-passing enhanced Graph Neural Network (DM-GNN)
DM-GNN can obtain a balanced representation by properly encoding multi-scale scene graph information.
We conduct extensive experiments on datasets including GQA, VG, motif-VG and achieve new state of the art.
arXiv Detail & Related papers (2022-05-09T11:08:43Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - Cross-modal Knowledge Reasoning for Knowledge-based Visual Question
Answering [27.042604046441426]
Knowledge-based Visual Question Answering (KVQA) requires external knowledge beyond the visible content to answer questions about an image.
In this paper, we depict an image by multiple knowledge graphs from the visual, semantic and factual views.
We decompose the model into a series of memory-based reasoning steps, each performed by a G raph-based R ead, U pdate, and C ontrol.
We achieve a new state-of-the-art performance on three popular benchmark datasets, including FVQA, Visual7W-KB and OK-VQA.
arXiv Detail & Related papers (2020-08-31T23:25:01Z) - GPS-Net: Graph Property Sensing Network for Scene Graph Generation [91.60326359082408]
Scene graph generation (SGG) aims to detect objects in an image along with their pairwise relationships.
GPS-Net fully explores three properties for SGG: edge direction information, the difference in priority between nodes, and the long-tailed distribution of relationships.
GPS-Net achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics.
arXiv Detail & Related papers (2020-03-29T07:22:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.