Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering
- URL: http://arxiv.org/abs/2106.13432v1
- Date: Fri, 25 Jun 2021 05:12:42 GMT
- Title: Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering
- Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
- Abstract summary: Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
- Score: 27.979053252431306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Question Answering (Video QA) is a powerful testbed to develop new AI
capabilities. This task necessitates learning to reason about objects,
relations, and events across visual and linguistic domains in space-time.
High-level reasoning demands lifting from associative visual pattern
recognition to symbol-like manipulation over objects, their behavior and
interactions. Toward reaching this goal we propose an object-oriented reasoning
approach in that video is abstracted as a dynamic stream of interacting
objects. At each stage of the video event flow, these objects interact with
each other, and their interactions are reasoned about with respect to the query
and under the overall context of a video. This mechanism is materialized into a
family of general-purpose neural units and their multi-level architecture
called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks.
This neural model maintains the objects' consistent lifelines in the form of a
hierarchically nested spatio-temporal graph. Within this graph, the dynamic
interactive object-oriented representations are built up along the video
sequence, hierarchically abstracted in a bottom-up manner, and converge toward
the key information for the correct answer. The method is evaluated on multiple
major Video QA datasets and establishes new state-of-the-arts in these tasks.
Analysis into the model's behavior indicates that object-oriented reasoning is
a reliable, interpretable and efficient approach to Video QA.
Related papers
- Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Object-Centric Representation Learning for Video Question Answering [27.979053252431306]
Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
arXiv Detail & Related papers (2021-04-12T02:37:20Z) - Grounding Physical Concepts of Objects and Events Through Dynamic Visual
Reasoning [84.90458333884443]
We present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language.
DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these presentations for answering queries.
DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training.
arXiv Detail & Related papers (2021-03-30T17:59:48Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z) - Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph.
Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action.
Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z) - Dynamic Language Binding in Relational Visual Reasoning [67.85579756590478]
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains.
Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved.
arXiv Detail & Related papers (2020-04-30T06:26:20Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.