HySTER: A Hybrid Spatio-Temporal Event Reasoner
- URL: http://arxiv.org/abs/2101.06644v1
- Date: Sun, 17 Jan 2021 11:07:17 GMT
- Title: HySTER: A Hybrid Spatio-Temporal Event Reasoner
- Authors: Theophile Sautory, Nuri Cingillioglu, Alessandra Russo
- Abstract summary: We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
- Score: 75.41988728376081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Video Question Answering (VideoQA) consists in answering natural
language questions about a video and serves as a proxy to evaluate the
performance of a model in scene sequence understanding. Most methods designed
for VideoQA up-to-date are end-to-end deep learning architectures which
struggle at complex temporal and causal reasoning and provide limited
transparency in reasoning steps. We present the HySTER: a Hybrid
Spatio-Temporal Event Reasoner to reason over physical events in videos. Our
model leverages the strength of deep learning methods to extract information
from video frames with the reasoning capabilities and explainability of
symbolic artificial intelligence in an answer set programming framework. We
define a method based on general temporal, causal and physics rules which can
be transferred across tasks. We apply our model to the CLEVRER dataset and
demonstrate state-of-the-art results in question answering accuracy. This work
sets the foundations for the incorporation of inductive logic programming in
the field of VideoQA.
Related papers
- Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers.
We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question.
In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.
Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.
We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - STAR: A Benchmark for Situated Reasoning in Real-World Videos [94.78038233351758]
This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos.
The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility.
We propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning.
arXiv Detail & Related papers (2024-05-15T21:53:54Z) - Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [0.9712140341805068]
We propose a neural-symbolic framework called Symbolic-world VideoQA (NSVideo-QA) for real-world VideoQA tasks.
NSVideo-QA exhibits internal consistency in answering compositional questions and significantly improves the capability of logical inference for VideoQA tasks.
arXiv Detail & Related papers (2024-04-05T10:30:38Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.