AttendAffectNet: Self-Attention based Networks for Predicting Affective
Responses from Movies
- URL: http://arxiv.org/abs/2010.11188v1
- Date: Wed, 21 Oct 2020 05:13:24 GMT
- Title: AttendAffectNet: Self-Attention based Networks for Predicting Affective
Responses from Movies
- Authors: Ha Thi Phuong Thao, Balamurali B.T., Dorien Herremans and Gemma Roig
- Abstract summary: We propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet.
We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction.
Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction.
- Score: 16.45955178108593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose different variants of the self-attention based
network for emotion prediction from movies, which we call AttendAffectNet. We
take both audio and video into account and incorporate the relation among
multiple modalities by applying self-attention mechanism in a novel manner into
the extracted features for emotion prediction. We compare it to the typically
temporal integration of the self-attention based model, which in our case,
allows to capture the relation of temporal representations of the movie while
considering the sequential dependencies of emotion responses. We demonstrate
the effectiveness of our proposed architectures on the extended COGNIMUSE
dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3],
which consist of movies with emotion annotations. Our results show that
applying the self-attention mechanism on the different audio-visual features,
rather than in the time domain, is more effective for emotion prediction. Our
approach is also proven to outperform many state-ofthe-art models for emotion
prediction. The code to reproduce our results with the models' implementation
is available at: https://github.com/ivyha010/AttendAffectNet.
Related papers
- Enhancing the Prediction of Emotional Experience in Movies using Deep
Neural Networks: The Significance of Audio and Language [0.0]
Our paper focuses on making use of deep neural network models to accurately predict the range of human emotions experienced during watching movies.
In this certain setup, there exist three clear-cut input modalities that considerably influence the experienced emotions: visual cues derived from RGB video frames, auditory components encompassing sounds, speech, and music, and linguistic elements encompassing actors' dialogues.
arXiv Detail & Related papers (2023-06-17T17:40:27Z) - Mutilmodal Feature Extraction and Attention-based Fusion for Emotion
Estimation in Videos [16.28109151595872]
We introduce our submission to the CVPR 2023 Competition on Affective Behavior Analysis in-the-wild (ABAW)
We exploited multimodal features extracted from video of different lengths from the competition dataset, including audio, pose and images.
Our system achieves the performance of 0.361 on the validation dataset.
arXiv Detail & Related papers (2023-03-18T14:08:06Z) - Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization.
The emotions have extremely varied temporal dynamics.
The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z) - Seeking Subjectivity in Visual Emotion Distribution Learning [93.96205258496697]
Visual Emotion Analysis (VEA) aims to predict people's emotions towards different visual stimuli.
Existing methods often predict visual emotion distribution in a unified network, neglecting the inherent subjectivity in its crowd voting process.
We propose a novel textitSubjectivity Appraise-and-Match Network (SAMNet) to investigate the subjectivity in visual emotion distribution.
arXiv Detail & Related papers (2022-07-25T02:20:03Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Affect2MM: Affective Analysis of Multimedia Content Using Emotion
Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content.
Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z) - Modality-Transferable Emotion Embeddings for Low-Resource Multimodal
Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues.
Our model achieves state-of-the-art performance on most of the emotion categories.
Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z) - Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System [8.900866276512364]
Current approaches overlook the video's emotional characteristics in the music generation step.
We propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion.
Our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets.
arXiv Detail & Related papers (2020-04-05T07:18:28Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.