Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System
- URL: http://arxiv.org/abs/2004.02113v1
- Date: Sun, 5 Apr 2020 07:18:28 GMT
- Title: Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System
- Authors: Gwenaelle Cunha Sergio and Minho Lee
- Abstract summary: Current approaches overlook the video's emotional characteristics in the music generation step.
We propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion.
Our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets.
- Score: 8.900866276512364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating music with emotion similar to that of an input video is a very
relevant issue nowadays. Video content creators and automatic movie directors
benefit from maintaining their viewers engaged, which can be facilitated by
producing novel material eliciting stronger emotions in them. Moreover, there's
currently a demand for more empathetic computers to aid humans in applications
such as augmenting the perception ability of visually and/or hearing impaired
people. Current approaches overlook the video's emotional characteristics in
the music generation step, only consider static images instead of videos, are
unable to generate novel music, and require a high level of human effort and
skills. In this study, we propose a novel hybrid deep neural network that uses
an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its
visual features and a deep Long Short-Term Memory Recurrent Neural Network to
generate its corresponding audio signals with similar emotional inkling. The
former is able to appropriately model emotions due to its fuzzy properties, and
the latter is able to model data with dynamic time properties well due to the
availability of the previous hidden state information. The novelty of our
proposed method lies in the extraction of visual emotional features in order to
transform them into audio signals with corresponding emotional aspects for
users. Quantitative experiments show low mean absolute errors of 0.217 and
0.255 in the Lindsey and DEAP datasets respectively, and similar global
features in the spectrograms. This indicates that our model is able to
appropriately perform domain transformation between visual and audio features.
Based on experimental results, our model can effectively generate audio that
matches the scene eliciting a similar emotion from the viewer in both datasets,
and music generated by our model is also chosen more often.
Related papers
- Audio-Driven Emotional 3D Talking-Head Generation [47.6666060652434]
We present a novel system for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions.
We propose a pose sampling method that generates natural idle-state (non-speaking) videos in response to silent audio inputs.
arXiv Detail & Related papers (2024-10-07T08:23:05Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Music Emotion Prediction Using Recurrent Neural Networks [8.867897390286815]
This study aims to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states.
We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories.
Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks.
arXiv Detail & Related papers (2024-05-10T18:03:20Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Emotion recognition in talking-face videos using persistent entropy and
neural networks [0.5156484100374059]
We use persistent entropy and neural networks as main tools to recognise and classify emotions from talking-face videos.
We prove that small changes in the video produce small changes in the signature.
These topological signatures are used to feed a neural network to distinguish between the following emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
arXiv Detail & Related papers (2021-10-26T11:08:56Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.