Taking an Emotional Look at Video Paragraph Captioning
- URL: http://arxiv.org/abs/2203.06356v1
- Date: Sat, 12 Mar 2022 06:19:48 GMT
- Title: Taking an Emotional Look at Video Paragraph Captioning
- Authors: Qinyu Li, Tengpeng Li, Hanli Wang, Chang Wen Chen
- Abstract summary: This work is conducted on video paragraph captioning, with the goal to generate paragraph-level descriptions for a given video.
To solve this problem, we propose to construct a large-scale emotion and logic driven multilingual dataset for this task.
This dataset is named EMVPC and contains 53 widely-used emotions in daily life, 376 common scenes corresponding to these emotions, 10,291 high-quality videos and 20,582 elaborated paragraph captions with English and Chinese versions.
- Score: 38.98567869515748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translating visual data into natural language is essential for machines to
understand the world and interact with humans. In this work, a comprehensive
study is conducted on video paragraph captioning, with the goal to generate
paragraph-level descriptions for a given video. However, current researches
mainly focus on detecting objective facts, ignoring the needs to establish the
logical associations between sentences and to discover more accurate emotions
related to video contents. Such a problem impairs fluent and abundant
expressions of predicted captions, which are far below human language tandards.
To solve this problem, we propose to construct a large-scale emotion and logic
driven multilingual dataset for this task. This dataset is named EMVPC
(standing for "Emotional Video Paragraph Captioning") and contains 53
widely-used emotions in daily life, 376 common scenes corresponding to these
emotions, 10,291 high-quality videos and 20,582 elaborated paragraph captions
with English and Chinese versions. Relevant emotion categories, scene labels,
emotion word labels and logic word labels are also provided in this new
dataset. The proposed EMVPC dataset intends to provide full-fledged video
paragraph captioning in terms of rich emotions, coherent logic and elaborate
expressions, which can also benefit other tasks in vision-language fields.
Furthermore, a comprehensive study is conducted through experiments on existing
benchmark video paragraph captioning datasets and the proposed EMVPC. The
stateof-the-art schemes from different visual captioning tasks are compared in
terms of 15 popular metrics, and their detailed objective as well as subjective
results are summarized. Finally, remaining problems and future directions of
video paragraph captioning are also discussed. The unique perspective of this
work is expected to boost further development in video paragraph captioning
research.
Related papers
- SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities [0.7510165488300369]
This paper proposes a novel Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities (SPECTRUM) framework.
SPECTRUM discerns multimodal semantics and emotional themes using Visual Text Attribute Investigation (VTAI) and determines the orientation of descriptive captions.
They exploit video-to-text retrieval capabilities and the multifaceted nature of video content to estimate the emotional probabilities of candidate captions.
arXiv Detail & Related papers (2024-11-04T10:51:47Z) - Generative Emotion Cause Explanation in Multimodal Conversations [23.39751445330256]
We propose a new task, textbfMultimodal textbfConversation textbfEmotion textbfCause textbfExplanation (MCECE)
It aims to generate a detailed explanation of the emotional cause to the target utterance within a multimodal conversation scenario.
A novel approach, FAME-Net, is proposed, that harnesses the power of Large Language Models (LLMs) to analyze visual data and accurately interpret the emotions conveyed through facial expressions in videos.
arXiv Detail & Related papers (2024-11-01T09:16:30Z) - Video Summarization: Towards Entity-Aware Captions [75.71891605682931]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs.
Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos.
Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.