TrUMAn: Trope Understanding in Movies and Animations
- URL: http://arxiv.org/abs/2108.04542v2
- Date: Wed, 11 Aug 2021 12:22:47 GMT
- Title: TrUMAn: Trope Understanding in Movies and Animations
- Authors: Hung-Ting Su, Po-Wei Shen, Bing-Chen Tsai, Wen-Feng Cheng, Ke-Jyun
Wang, Winston H. Hsu
- Abstract summary: We present a Trope Understanding and Storytelling (TrUSt) dataset with a new Conceptual module.
TrUSt guides the video encoder by performing video storytelling on a latent space.
Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01% of accuracy with raw input signals.
- Score: 19.80173687261055
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding and comprehending video content is crucial for many real-world
applications such as search and recommendation systems. While recent progress
of deep learning has boosted performance on various tasks using visual cues,
deep cognition to reason intentions, motivation, or causality remains
challenging. Existing datasets that aim to examine video reasoning capability
focus on visual signals such as actions, objects, relations, or could be
answered utilizing text bias. Observing this, we propose a novel task, along
with a new dataset: Trope Understanding in Movies and Animations (TrUMAn),
intending to evaluate and develop learning systems beyond visual signals.
Tropes are frequently used storytelling devices for creative works. By coping
with the trope understanding task and enabling the deep cognition skills of
machines, we are optimistic that data mining applications and algorithms could
be taken to the next level. To tackle the challenging TrUMAn dataset, we
present a Trope Understanding and Storytelling (TrUSt) with a new Conceptual
Storyteller module, which guides the video encoder by performing video
storytelling on a latent space. The generated story embedding is then fed into
the trope understanding model to provide further signals. Experimental results
demonstrate that state-of-the-art learning systems on existing tasks reach only
12.01% of accuracy with raw input signals. Also, even in the oracle case with
human-annotated descriptions, BERT contextual embedding achieves at most 28% of
accuracy. Our proposed TrUSt boosts the model performance and reaches 13.94%
performance. We also provide detailed analysis to pave the way for future
research. TrUMAn is publicly available
at:https://www.cmlab.csie.ntu.edu.tw/project/trope
Related papers
- Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Situation and Behavior Understanding by Trope Detection on Films [26.40954537814751]
We present a novel task, trope detection on films, in an effort to create a situation and behavior understanding for machines.
We introduce a new dataset, Tropes in Movie Synopses (TiMoS), with 5623 movie synopses and 95 different tropes collecting from a Wikipedia-style database, TVTropes.
We present a multi-stream comprehension network (MulCom) leveraging multi-level attention of words, sentences, and role relations.
arXiv Detail & Related papers (2021-01-19T14:09:54Z) - Co-attentional Transformers for Story-Based Video Understanding [24.211255523490692]
We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
arXiv Detail & Related papers (2020-10-27T07:17:09Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z) - HLVU : A New Challenge to Test Deep Understanding of Movies the Way
Humans do [3.423039905282442]
We propose a new evaluation challenge and direction in the area of High-level Video Understanding.
The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other.
A pilot High-Level Video Understanding dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them.
A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts.
arXiv Detail & Related papers (2020-05-01T15:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.