HLVU : A New Challenge to Test Deep Understanding of Movies the Way
Humans do
- URL: http://arxiv.org/abs/2005.00463v1
- Date: Fri, 1 May 2020 15:58:13 GMT
- Title: HLVU : A New Challenge to Test Deep Understanding of Movies the Way
Humans do
- Authors: Keith Curtis, George Awad, Shahzad Rajput, and Ian Soboroff
- Abstract summary: We propose a new evaluation challenge and direction in the area of High-level Video Understanding.
The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other.
A pilot High-Level Video Understanding dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them.
A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts.
- Score: 3.423039905282442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we propose a new evaluation challenge and direction in the area
of High-level Video Understanding. The challenge we are proposing is designed
to test automatic video analysis and understanding, and how accurately systems
can comprehend a movie in terms of actors, entities, events and their
relationship to each other. A pilot High-Level Video Understanding (HLVU)
dataset of open source movies were collected for human assessors to build a
knowledge graph representing each of them. A set of queries will be derived
from the knowledge graph to test systems on retrieving relationships among
actors, as well as reasoning and retrieving non-visual concepts. The objective
is to benchmark if a computer system can "understand" non-explicit but obvious
relationships the same way humans do when they watch the same movies. This is
long-standing problem that is being addressed in the text domain and this
project moves similar research to the video domain. Work of this nature is
foundational to future video analytics and video understanding technologies.
This work can be of interest to streaming services and broadcasters hoping to
provide more intuitive ways for their customers to interact with and consume
video content.
Related papers
- A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - VideoChat: Chat-Centric Video Understanding [80.63932941216129]
We develop an end-to-end chat-centric video understanding system, coined as VideoChat.
It integrates video foundation models and large language models via a learnable neural interface.
Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications.
arXiv Detail & Related papers (2023-05-10T17:59:04Z) - Contextual Explainable Video Representation:\\Human Perception-based
Understanding [10.172332586182792]
We discuss approaches that incorporate the human perception process into modeling actors, objects, and the environment.
We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding.
arXiv Detail & Related papers (2022-12-12T19:29:07Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - EgoTaskQA: Understanding Human Tasks in Egocentric Videos [89.9573084127155]
EgoTaskQA benchmark provides home for crucial dimensions of task understanding through question-answering on real-world egocentric videos.
We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others.
We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos.
arXiv Detail & Related papers (2022-10-08T05:49:05Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - TrUMAn: Trope Understanding in Movies and Animations [19.80173687261055]
We present a Trope Understanding and Storytelling (TrUSt) dataset with a new Conceptual module.
TrUSt guides the video encoder by performing video storytelling on a latent space.
Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01% of accuracy with raw input signals.
arXiv Detail & Related papers (2021-08-10T09:34:14Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA [24.910132013543947]
We propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story.
Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips.
We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters.
arXiv Detail & Related papers (2020-05-07T09:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.