An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos
- URL: http://arxiv.org/abs/2003.00832v1
- Date: Wed, 12 Feb 2020 15:33:59 GMT
- Title: An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos
- Authors: Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei
Xu, Runbo Hu, Hua Chai, Kurt Keutzer
- Abstract summary: We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
- Score: 64.91614454412257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion recognition in user-generated videos plays an important role in
human-centered computing. Existing methods mainly employ traditional two-stage
shallow pipeline, i.e. extracting visual and/or audio features and training
classifiers. In this paper, we propose to recognize video emotions in an
end-to-end manner based on convolutional neural networks (CNNs). Specifically,
we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture
that integrates spatial, channel-wise, and temporal attentions into a visual 3D
CNN and temporal attentions into an audio 2D CNN. Further, we design a special
classification loss, i.e. polarity-consistent cross-entropy loss, based on the
polarity-emotion hierarchy constraint to guide the attention generation.
Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6
datasets demonstrate that the proposed VAANet outperforms the state-of-the-art
approaches for video emotion recognition. Our source code is released at:
https://github.com/maysonma/VAANet.
Related papers
- Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Visual Attention Network [90.0753726786985]
We propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention.
We also introduce a novel neural network based on LKA, namely Visual Attention Network (VAN)
VAN outperforms the state-of-the-art vision transformers and convolutional neural networks with a large margin in extensive experiments.
arXiv Detail & Related papers (2022-02-20T06:35:18Z) - Emotion recognition in talking-face videos using persistent entropy and
neural networks [0.5156484100374059]
We use persistent entropy and neural networks as main tools to recognise and classify emotions from talking-face videos.
We prove that small changes in the video produce small changes in the signature.
These topological signatures are used to feed a neural network to distinguish between the following emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
arXiv Detail & Related papers (2021-10-26T11:08:56Z) - Continuous Emotion Recognition with Spatiotemporal Convolutional Neural
Networks [82.54695985117783]
We investigate the suitability of state-of-the-art deep learning architectures for continuous emotion recognition using long video sequences captured in-the-wild.
We have developed and evaluated convolutional recurrent neural networks combining 2D-CNNs and long short term-memory units, and inflated 3D-CNN models, which are built by inflating the weights of a pre-trained 2D-CNN model during fine-tuning.
arXiv Detail & Related papers (2020-11-18T13:42:05Z) - Emotion Recognition in Audio and Video Using Deep Neural Networks [9.694548197876868]
With advancement of deep learning technology there has been significant improvement of speech recognition.
Recognizing emotion from speech is important aspect and with deep learning technology emotion recognition has improved in accuracy and latency.
In this work, we attempt to explore different neural networks to improve accuracy of emotion recognition.
arXiv Detail & Related papers (2020-06-15T04:50:18Z) - Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System [8.900866276512364]
Current approaches overlook the video's emotional characteristics in the music generation step.
We propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion.
Our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets.
arXiv Detail & Related papers (2020-04-05T07:18:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.