Related papers: Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline

URL: http://arxiv.org/abs/2311.17335v2
Date: Mon, 09 Dec 2024 13:55:27 GMT
Title: Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and Baseline
Authors: Xuecheng Wu, Heli Sun, Junxiao Xue, Jiayu Nie, Xiangyan Kong, Ruofan Zhai, Liang He,
Abstract summary: The prevailing use of short-form videos (SVs) leads to the necessity of conducting video emotion analysis (VEA) towards SVs.<n>Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.<n>We present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations.
Score: 6.676841280436392
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Nowadays, short-form videos (SVs) are essential to web information acquisition and sharing in our daily life. The prevailing use of SVs to spread emotions leads to the necessity of conducting video emotion analysis (VEA) towards SVs. Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we alleviate the impact of subjectivities on labeling quality by emphasizing better personnel allocations and multi-stage annotations. In addition, we provide the category-balanced and test-oriented variants through targeted data sampling. Some commonly used videos, such as facial expressions, have been well studied. However, it is still challenging to analysis the emotions in SVs. Since the broader content diversity brings more distinct semantic gaps and difficulties in learning emotion-related features, and there exists local biases and collective information gaps caused by the emotion inconsistence under the prevalently audio-visual co-expressions. To tackle these challenges, we present an end-to-end audio-visual baseline AV-CANet which employs the video transformer to better learn semantically relevant representations. We further design the Local-Global Fusion Module to progressively capture the correlations of audio-visual features. The EP-CE Loss is then introduced to guide model optimization. Extensive experimental results on seven datasets demonstrate the effectiveness of AV-CANet, while providing broad insights for future works. Besides, we investigate the key components of AV-CANet by ablation studies. Datasets and code will be fully open soon.

Related papers

eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos [15.533003031406551]
Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information.<n>Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations.<n>We propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations.
arXiv Detail & Related papers (2025-08-09T09:27:45Z)
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects [53.15503034595476]
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision.<n>VSP has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes.
arXiv Detail & Related papers (2025-06-16T14:39:03Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge. We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding [25.4933695784155]
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset. We developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users.
arXiv Detail & Related papers (2024-07-11T03:00:26Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Data Augmentation for Emotion Detection in Small Imbalanced Text Data [0.0]
One of the challenges is the shortage of available datasets that have been annotated with emotions. We studied the impact of data augmentation techniques precisely when applied to small imbalanced datasets. Our experimental results show that using the augmented data when training the classifier model leads to significant improvements.
arXiv Detail & Related papers (2023-10-25T21:29:36Z)
Efficient Labelling of Affective Video Datasets via Few-Shot & Multi-Task Contrastive Learning [5.235294751659532]
We propose Multi-Task Contrastive Learning for Affect Representation (textbfMT-CLAR) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images. We extend the image-based MT-CLAR framework for automated video labelling.
arXiv Detail & Related papers (2023-08-04T07:19:08Z)
Disentangled Variational Autoencoder for Emotion Recognition in Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC) VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space. Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z)
Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain. We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features. DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z)
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
Use of Affective Visual Information for Summarization of Human-Centric Videos [13.273989782771556]
We investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes. Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM)
arXiv Detail & Related papers (2021-07-08T11:46:04Z)
Emotional Semantics-Preserved and Feature-Aligned CycleGAN for Visual Emotion Adaptation [85.20533077846606]
Unsupervised domain adaptation (UDA) studies the problem of transferring models trained on one labeled source domain to another unlabeled target domain. In this paper, we focus on UDA in visual emotion analysis for both emotion distribution learning and dominant emotion classification. We propose a novel end-to-end cycle-consistent adversarial model, termed CycleEmotionGAN++.
arXiv Detail & Related papers (2020-11-25T01:31:01Z)
Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework. We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.