Question-Driven Graph Fusion Network For Visual Question Answering
- URL: http://arxiv.org/abs/2204.00975v1
- Date: Sun, 3 Apr 2022 03:02:03 GMT
- Title: Question-Driven Graph Fusion Network For Visual Question Answering
- Authors: Yuxi Qian, Yuncong Hu, Ruonan Wang, Fangxiang Feng and Xiaojie Wang
- Abstract summary: We propose a Question-Driven Graph Fusion Network (QD-GFN)
It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation process of the three graphs.
Experiment results demonstrate that our QD-GFN outperforms the prior state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets.
- Score: 15.098694655795168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Visual Question Answering (VQA) models have explored various visual
relationships between objects in the image to answer complex questions, which
inevitably introduces irrelevant information brought by inaccurate object
detection and text grounding. To address the problem, we propose a
Question-Driven Graph Fusion Network (QD-GFN). It first models semantic,
spatial, and implicit visual relations in images by three graph attention
networks, then question information is utilized to guide the aggregation
process of the three graphs, further, our QD-GFN adopts an object filtering
mechanism to remove question-irrelevant objects contained in the image.
Experiment results demonstrate that our QD-GFN outperforms the prior
state-of-the-art on both VQA 2.0 and VQA-CP v2 datasets. Further analysis shows
that both the novel graph aggregation method and object filtering mechanism
play a significant role in improving the performance of the model.
Related papers
- A Comprehensive Survey on Visual Question Answering Datasets and Algorithms [1.941892373913038]
We meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category.
We explore six main paradigms of VQA models: fusion, attention, the technique of using information from one modality to filter information from another, external knowledge base, composition or reasoning, and graph models.
arXiv Detail & Related papers (2024-11-17T18:52:06Z) - No-Reference Point Cloud Quality Assessment via Graph Convolutional Network [89.12589881881082]
Three-dimensional (3D) point cloud, as an emerging visual media format, is increasingly favored by consumers.
Point clouds inevitably suffer from quality degradation and information loss through multimedia communication systems.
We propose a novel no-reference PCQA method by using a graph convolutional network (GCN) to characterize the mutual dependencies of multi-view 2D projected image contents.
arXiv Detail & Related papers (2024-11-12T11:39:05Z) - Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference [107.53380946417003]
We propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference.
We develop a simple methodology to self-learn the visual hints without introducing any additional human annotations.
arXiv Detail & Related papers (2024-07-06T15:07:32Z) - Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question
Answering [16.502197578954917]
graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features.
We propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA.
arXiv Detail & Related papers (2023-07-25T04:41:32Z) - Joint learning of object graph and relation graph for visual question
answering [19.97265717398179]
We introduce a novel Dual Message-passing enhanced Graph Neural Network (DM-GNN)
DM-GNN can obtain a balanced representation by properly encoding multi-scale scene graph information.
We conduct extensive experiments on datasets including GQA, VG, motif-VG and achieve new state of the art.
arXiv Detail & Related papers (2022-05-09T11:08:43Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Visual Relationship Forecasting in Videos [56.122037294234865]
We present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a manner of reasoning.
Given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence.
To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series oftemporally localized visual relation annotations in a video.
arXiv Detail & Related papers (2021-07-02T16:43:19Z) - GPS-Net: Graph Property Sensing Network for Scene Graph Generation [91.60326359082408]
Scene graph generation (SGG) aims to detect objects in an image along with their pairwise relationships.
GPS-Net fully explores three properties for SGG: edge direction information, the difference in priority between nodes, and the long-tailed distribution of relationships.
GPS-Net achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics.
arXiv Detail & Related papers (2020-03-29T07:22:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.