DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA
- URL: http://arxiv.org/abs/2005.03356v2
- Date: Thu, 17 Dec 2020 02:59:37 GMT
- Title: DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA
- Authors: Seongho Choi, Kyoung-Woon On, Yu-Jung Heo, Ahjeong Seo, Youwon Jang,
Minsu Lee, Byoung-Tak Zhang
- Abstract summary: We propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story.
Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips.
We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters.
- Score: 24.910132013543947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress on computer vision and natural language processing,
developing a machine that can understand video story is still hard to achieve
due to the intrinsic difficulty of video story. Moreover, researches on how to
evaluate the degree of video understanding based on human cognitive process
have not progressed as yet. In this paper, we propose a novel video question
answering (Video QA) task, DramaQA, for a comprehensive understanding of the
video story. The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an
evaluation metric based on the cognitive developmental stages of human
intelligence. 2) Character-centered video annotations to model local coherence
of the story. Our dataset is built upon the TV drama "Another Miss Oh" and it
contains 17,983 QA pairs from 23,928 various length video clips, with each QA
pair belonging to one of four difficulty levels. We provide 217,308 annotated
images with rich character-centered annotations, including visual bounding
boxes, behaviors and emotions of main characters, and coreference resolved
scripts. Additionally, we suggest Multi-level Context Matching model which
hierarchically understands character-centered representations of video to
answer questions. We release our dataset and model publicly for research
purposes, and we expect our work to provide a new perspective on video story
understanding research.
Related papers
- FunQA: Towards Surprising Video Comprehension [64.58663825184958]
We introduce FunQA, a challenging video question-answering dataset.
FunQA covers three previously unexplored types of surprising videos: HumorQA, CreativeQA, and MagicQA.
In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips.
arXiv Detail & Related papers (2023-06-26T17:59:55Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Structured Two-stream Attention Network for Video Question Answering [168.95603875458113]
We propose a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question.
First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features.
Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text.
arXiv Detail & Related papers (2022-06-02T12:25:52Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Co-attentional Transformers for Story-Based Video Understanding [24.211255523490692]
We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
arXiv Detail & Related papers (2020-10-27T07:17:09Z) - Knowledge-Based Video Question Answering with Unsupervised Scene
Descriptions [27.63022376316052]
We design ROLL, a model for knowledge-based video story question answering.
In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion.
To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism.
arXiv Detail & Related papers (2020-07-17T04:26:38Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - HLVU : A New Challenge to Test Deep Understanding of Movies the Way
Humans do [3.423039905282442]
We propose a new evaluation challenge and direction in the area of High-level Video Understanding.
The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other.
A pilot High-Level Video Understanding dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them.
A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts.
arXiv Detail & Related papers (2020-05-01T15:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.