eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos
- URL: http://arxiv.org/abs/2311.17335v1
- Date: Wed, 29 Nov 2023 03:24:30 GMT
- Title: eMotions: A Large-Scale Dataset for Emotion Recognition in Short Videos
- Authors: Xuecheng Wu, Heli Sun, Junxiao Xue, Ruofan Zhai, Xiangyan Kong, Jiayu
Nie, Liang He
- Abstract summary: The prevailing use of short videos (SVs) leads to the necessity of emotion recognition in SVs.
Considering the lack of SVs emotion data, we introduce a large-scale dataset named eMotions, comprising 27,996 videos.
We present an end-to-end baseline method AV-CPNet that employs the video transformer to better learn semantically relevant representations.
- Score: 7.011656298079659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, short videos (SVs) are essential to information acquisition and
sharing in our life. The prevailing use of SVs to spread emotions leads to the
necessity of emotion recognition in SVs. Considering the lack of SVs emotion
data, we introduce a large-scale dataset named eMotions, comprising 27,996
videos. Meanwhile, we alleviate the impact of subjectivities on labeling
quality by emphasizing better personnel allocations and multi-stage
annotations. In addition, we provide the category-balanced and test-oriented
variants through targeted data sampling. Some commonly used videos (e.g.,
facial expressions and postures) have been well studied. However, it is still
challenging to understand the emotions in SVs. Since the enhanced content
diversity brings more distinct semantic gaps and difficulties in learning
emotion-related features, and there exists information gaps caused by the
emotion incompleteness under the prevalently audio-visual co-expressions. To
tackle these problems, we present an end-to-end baseline method AV-CPNet that
employs the video transformer to better learn semantically relevant
representations. We further design the two-stage cross-modal fusion module to
complementarily model the correlations of audio-visual features. The EP-CE
Loss, incorporating three emotion polarities, is then applied to guide model
optimization. Extensive experimental results on nine datasets verify the
effectiveness of AV-CPNet. Datasets and code will be open on
https://github.com/XuecWu/eMotions.
Related papers
- Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding [25.4933695784155]
Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders.
To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset.
We developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users.
arXiv Detail & Related papers (2024-07-11T03:00:26Z) - VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Data Augmentation for Emotion Detection in Small Imbalanced Text Data [0.0]
One of the challenges is the shortage of available datasets that have been annotated with emotions.
We studied the impact of data augmentation techniques precisely when applied to small imbalanced datasets.
Our experimental results show that using the augmented data when training the classifier model leads to significant improvements.
arXiv Detail & Related papers (2023-10-25T21:29:36Z) - Efficient Labelling of Affective Video Datasets via Few-Shot &
Multi-Task Contrastive Learning [5.235294751659532]
We propose Multi-Task Contrastive Learning for Affect Representation (textbfMT-CLAR) for few-shot affect inference.
MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images.
We extend the image-based MT-CLAR framework for automated video labelling.
arXiv Detail & Related papers (2023-08-04T07:19:08Z) - Disentangled Variational Autoencoder for Emotion Recognition in
Conversations [14.92924920489251]
We propose a VAD-disentangled Variational AutoEncoder (VAD-VAE) for Emotion Recognition in Conversations (ERC)
VAD-VAE disentangles three affect representations Valence-Arousal-Dominance (VAD) from the latent space.
Experiments show that VAD-VAE outperforms the state-of-the-art model on two datasets.
arXiv Detail & Related papers (2023-05-23T13:50:06Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Emotional Semantics-Preserved and Feature-Aligned CycleGAN for Visual
Emotion Adaptation [85.20533077846606]
Unsupervised domain adaptation (UDA) studies the problem of transferring models trained on one labeled source domain to another unlabeled target domain.
In this paper, we focus on UDA in visual emotion analysis for both emotion distribution learning and dominant emotion classification.
We propose a novel end-to-end cycle-consistent adversarial model, termed CycleEmotionGAN++.
arXiv Detail & Related papers (2020-11-25T01:31:01Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.