HySTER: A Hybrid Spatio-Temporal Event Reasoner
- URL: http://arxiv.org/abs/2101.06644v1
- Date: Sun, 17 Jan 2021 11:07:17 GMT
- Title: HySTER: A Hybrid Spatio-Temporal Event Reasoner
- Authors: Theophile Sautory, Nuri Cingillioglu, Alessandra Russo
- Abstract summary: We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
- Score: 75.41988728376081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Video Question Answering (VideoQA) consists in answering natural
language questions about a video and serves as a proxy to evaluate the
performance of a model in scene sequence understanding. Most methods designed
for VideoQA up-to-date are end-to-end deep learning architectures which
struggle at complex temporal and causal reasoning and provide limited
transparency in reasoning steps. We present the HySTER: a Hybrid
Spatio-Temporal Event Reasoner to reason over physical events in videos. Our
model leverages the strength of deep learning methods to extract information
from video frames with the reasoning capabilities and explainability of
symbolic artificial intelligence in an answer set programming framework. We
define a method based on general temporal, causal and physics rules which can
be transferred across tasks. We apply our model to the CLEVRER dataset and
demonstrate state-of-the-art results in question answering accuracy. This work
sets the foundations for the incorporation of inductive logic programming in
the field of VideoQA.
Related papers
- STAR: A Benchmark for Situated Reasoning in Real-World Videos [94.78038233351758]
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos.
The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility.
We propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning.
arXiv Detail & Related papers (2024-05-15T21:53:54Z) - Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [0.9712140341805068]
We propose a neural-symbolic framework called Symbolic-world VideoQA (NSVideo-QA) for real-world VideoQA tasks.
NSVideo-QA exhibits internal consistency in answering compositional questions and significantly improves the capability of logical inference for VideoQA tasks.
arXiv Detail & Related papers (2024-04-05T10:30:38Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.