Learning Situation Hyper-Graphs for Video Question Answering
- URL: http://arxiv.org/abs/2304.08682v2
- Date: Sat, 6 May 2023 06:44:56 GMT
- Title: Learning Situation Hyper-Graphs for Video Question Answering
- Authors: Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham,
Chuang Gan, Niels Lobo, Mubarak Shah
- Abstract summary: We propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs.
We train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip.
Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks.
- Score: 95.18071873415556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answering questions about complex situations in videos requires not only
capturing the presence of actors, objects, and their relations but also the
evolution of these relationships over time. A situation hyper-graph is a
representation that describes situations as scene sub-graphs for video frames
and hyper-edges for connected sub-graphs and has been proposed to capture all
such information in a compact structured form. In this work, we propose an
architecture for Video Question Answering (VQA) that enables answering
questions related to video content by predicting situation hyper-graphs, coined
Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we
train a situation hyper-graph decoder to implicitly identify graph
representations with actions and object/human-object relationships from the
input video clip. and to use cross-attention between the predicted situation
hyper-graphs and the question embedding to predict the correct answer. The
proposed method is trained in an end-to-end manner and optimized by a VQA loss
with the cross-entropy function and a Hungarian matching loss for the situation
graph prediction. The effectiveness of the proposed architecture is extensively
evaluated on two challenging benchmarks: AGQA and STAR. Our results show that
learning the underlying situation hyper-graphs helps the system to
significantly improve its performance for novel challenges of video
question-answering tasks.
Related papers
- SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based
Question Answering [0.0]
Scene graphs have emerged as a useful tool for multimodal image analysis.
Current methods that utilize idealized annotated scene graphs struggle to generalize when using predicted scene graphs extracted from images.
Our approach extracts a scene graph from an input image using a pre-trained scene graph generator.
arXiv Detail & Related papers (2023-10-03T07:14:53Z) - Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question
Answering [16.502197578954917]
graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features.
We propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA.
arXiv Detail & Related papers (2023-07-25T04:41:32Z) - ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning
over Untrimmed Videos [120.80589215132322]
We present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over challenging untrimmed videos from ActivityNet.
ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos.
The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.
arXiv Detail & Related papers (2023-05-04T03:04:59Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - A Hierarchical Reasoning Graph Neural Network for The Automatic Scoring
of Answer Transcriptions in Video Job Interviews [14.091472037847499]
We propose a Hierarchical Reasoning Graph Neural Network (HRGNN) for the automatic assessment of question-answer pairs.
We employ a semantic-level reasoning graph attention network to model the interaction states of the current QA session.
Finally, we propose a gated recurrent unit encoder to represent the temporal question-answer pairs for the final prediction.
arXiv Detail & Related papers (2020-12-22T12:27:45Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.