Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline
- URL: http://arxiv.org/abs/2311.17335v2
- Date: Mon, 09 Dec 2024 13:55:27 GMT
- Title: Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline
- Authors: Xuecheng Wu, Heli Sun, Junxiao Xue, Jiayu Nie, Xiangyan Kong, Ruofan Zhai, Liang He,
- Abstract summary: The prevailing use of short-form videos (SVs) leads to the necessity of conducting video emotion analysis (VEA) towards SVs.
Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.
We present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations.
- Score: 6.676841280436392
- License:
- Abstract: Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.
Related papers
- Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding [25.4933695784155]
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders.
To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset.
We developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users.
arXiv Detail & Related papers (2024-07-11T03:00:26Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Use of Affective Visual Information for Summarization of Human-Centric
Videos [13.273989782771556]
We investigate the affective-information enriched supervised video summarization task for human-centric videos.
First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes.
Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM)
arXiv Detail & Related papers (2021-07-08T11:46:04Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.