Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System
- URL: http://arxiv.org/abs/2004.02113v1
- Date: Sun, 5 Apr 2020 07:18:28 GMT
- Title: Emotional Video to Audio Transformation Using Deep Recurrent Neural
Networks and a Neuro-Fuzzy System
- Authors: Gwenaelle Cunha Sergio and Minho Lee
- Abstract summary: Current approaches overlook the video's emotional characteristics in the music generation step.
We propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion.
Our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets.
- Score: 8.900866276512364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating music with emotion similar to that of an input video is a very
relevant issue nowadays. Video content creators and automatic movie directors
benefit from maintaining their viewers engaged, which can be facilitated by
producing novel material eliciting stronger emotions in them. Moreover, there's
currently a demand for more empathetic computers to aid humans in applications
such as augmenting the perception ability of visually and/or hearing impaired
people. Current approaches overlook the video's emotional characteristics in
the music generation step, only consider static images instead of videos, are
unable to generate novel music, and require a high level of human effort and
skills. In this study, we propose a novel hybrid deep neural network that uses
an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its
visual features and a deep Long Short-Term Memory Recurrent Neural Network to
generate its corresponding audio signals with similar emotional inkling. The
former is able to appropriately model emotions due to its fuzzy properties, and
the latter is able to model data with dynamic time properties well due to the
availability of the previous hidden state information. The novelty of our
proposed method lies in the extraction of visual emotional features in order to
transform them into audio signals with corresponding emotional aspects for
users. Quantitative experiments show low mean absolute errors of 0.217 and
0.255 in the Lindsey and DEAP datasets respectively, and similar global
features in the spectrograms. This indicates that our model is able to
appropriately perform domain transformation between visual and audio features.
Based on experimental results, our model can effectively generate audio that
matches the scene eliciting a similar emotion from the viewer in both datasets,
and music generated by our model is also chosen more often.
Related papers
- Emotion Manipulation Through Music -- A Deep Learning Interactive Visual Approach [0.0]
We introduce a novel way to manipulate the emotional content of a song using AI tools.
Our goal is to achieve the desired emotion while leaving the original melody as intact as possible.
This research may contribute to on-demand custom music generation, the automated remixing of existing work, and music playlists tuned for emotional progression.
arXiv Detail & Related papers (2024-06-12T20:12:29Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Music Emotion Prediction Using Recurrent Neural Networks [8.867897390286815]
This study aims to enhance music recommendation systems and support therapeutic interventions by tailoring music to fit listeners' emotional states.
We utilize Russell's Emotion Quadrant to categorize music into four distinct emotional regions and develop models capable of accurately predicting these categories.
Our approach involves extracting a comprehensive set of audio features using Librosa and applying various recurrent neural network architectures, including standard RNNs, Bidirectional RNNs, and Long Short-Term Memory (LSTM) networks.
arXiv Detail & Related papers (2024-05-10T18:03:20Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - Emotion recognition in talking-face videos using persistent entropy and
neural networks [0.5156484100374059]
We use persistent entropy and neural networks as main tools to recognise and classify emotions from talking-face videos.
We prove that small changes in the video produce small changes in the signature.
These topological signatures are used to feed a neural network to distinguish between the following emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.
arXiv Detail & Related papers (2021-10-26T11:08:56Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis [55.24336227884039]
We present a novel framework to generate high-fidelity talking head video.
We use neural scene representation networks to bridge the gap between audio input and video output.
Our framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
arXiv Detail & Related papers (2021-03-20T02:58:13Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.