Object-Centric Representation Learning for Video Question Answering
- URL: http://arxiv.org/abs/2104.05166v2
- Date: Tue, 13 Apr 2021 07:36:07 GMT
- Title: Object-Centric Representation Learning for Video Question Answering
- Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
- Abstract summary: Video answering (Video QA) presents a powerful testbed for human-like intelligent behaviors.
The task demands new capabilities to integrate processing, language understanding, binding abstract concepts to concrete visual artifacts.
We propose a new query-guided representation framework to turn a video into a relational graph of objects.
- Score: 27.979053252431306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video question answering (Video QA) presents a powerful testbed for
human-like intelligent behaviors. The task demands new capabilities to
integrate video processing, language understanding, binding abstract linguistic
concepts to concrete visual artifacts, and deliberative reasoning over
spacetime. Neural networks offer a promising approach to reach this potential
through learning from examples rather than handcrafting features and rules.
However, neural networks are predominantly feature-based - they map data to
unstructured vectorial representation and thus can fall into the trap of
exploiting shortcuts through surface statistics instead of true systematic
reasoning seen in symbolic systems. To tackle this issue, we advocate for
object-centric representation as a basis for constructing spatio-temporal
structures from videos, essentially bridging the semantic gap between low-level
pattern recognition and high-level symbolic algebra. To this end, we propose a
new query-guided representation framework to turn a video into an evolving
relational graph of objects, whose features and interactions are dynamically
and conditionally inferred. The object lives are then summarized into resumes,
lending naturally for deliberative relational reasoning that produces an answer
to the query. The framework is evaluated on major Video QA datasets,
demonstrating clear benefits of the object-centric approach to video reasoning.
Related papers
- Object-Centric Temporal Consistency via Conditional Autoregressive Inductive Biases [69.46487306858789]
Conditional Autoregressive Slot Attention (CA-SA) is a framework that enhances the temporal consistency of extracted object-centric representations in video-centric vision tasks.
We present qualitative and quantitative results showing that our proposed method outperforms the considered baselines on downstream tasks.
arXiv Detail & Related papers (2024-10-21T07:44:44Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video
Question Answering [27.979053252431306]
Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities.
We propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects.
This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture.
arXiv Detail & Related papers (2021-06-25T05:12:42Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos.
We define a method based on general temporal, causal and physics rules which can be transferred across tasks.
This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Dynamic Language Binding in Relational Visual Reasoning [67.85579756590478]
We present Language-binding Object Graph Network, the first neural reasoning method with dynamic relational structures across both visual and textual domains.
Our method outperforms other methods in sophisticated question-answering tasks wherein multiple object relations are involved.
arXiv Detail & Related papers (2020-04-30T06:26:20Z) - Hierarchical Conditional Relation Networks for Video Question Answering [62.1146543269993]
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN)
CRN serves as a building block to construct more sophisticated structures for representation and reasoning over video.
Our evaluations on well-known datasets achieved new SoTA results, demonstrating the impact of building a general-purpose reasoning unit on complex domains such as VideoQA.
arXiv Detail & Related papers (2020-02-25T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.