StrengthNet: Deep Learning-based Emotion Strength Assessment for
Emotional Speech Synthesis
- URL: http://arxiv.org/abs/2110.03156v2
- Date: Fri, 8 Oct 2021 03:28:23 GMT
- Title: StrengthNet: Deep Learning-based Emotion Strength Assessment for
Emotional Speech Synthesis
- Authors: Rui Liu, Berrak Sisman, Haizhou Li
- Abstract summary: We propose a deep learning based emotion strength assessment network for strength prediction that is referred to as StrengthNet.
Our model conforms to a multi-task learning framework with a structure that includes an acoustic encoder, a strength predictor and an auxiliary emotion predictor.
Experiments show that the predicted emotion strength of the proposed StrengthNet are highly correlated with ground truth scores for seen and unseen speech.
- Score: 82.39099867188547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, emotional speech synthesis has achieved remarkable performance. The
emotion strength of synthesized speech can be controlled flexibly using a
strength descriptor, which is obtained by an emotion attribute ranking
function. However, a trained ranking function on specific data has poor
generalization, which limits its applicability for more realistic cases. In
this paper, we propose a deep learning based emotion strength assessment
network for strength prediction that is referred to as StrengthNet. Our model
conforms to a multi-task learning framework with a structure that includes an
acoustic encoder, a strength predictor and an auxiliary emotion predictor. A
data augmentation strategy was utilized to improve the model generalization.
Experiments show that the predicted emotion strength of the proposed
StrengthNet are highly correlated with ground truth scores for seen and unseen
speech. Our codes are available at: https://github.com/ttslr/StrengthNet.
Related papers
- AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect
Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations.
We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space.
We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z) - Describing emotions with acoustic property prompts for speech emotion
recognition [30.990720176317463]
We devise a method to automatically create a description for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate.
We train a neural network model using these audio-text pairs and evaluate the model using one more dataset.
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
arXiv Detail & Related papers (2022-11-14T20:29:37Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - FSER: Deep Convolutional Neural Networks for Speech Emotion Recognition [0.015863809575305417]
We introduce FSER, a speech emotion recognition model trained on four valid speech databases.
On each benchmark dataset, FSER outperforms the best models introduced so far, achieving a state-of-the-art performance.
FSER could potentially be used to improve mental and emotional health care.
arXiv Detail & Related papers (2021-09-15T05:03:24Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Leveraging Recent Advances in Deep Learning for Audio-Visual Emotion
Recognition [2.1485350418225244]
Spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis.
We propose a new deep learning-based approach for audio-visual emotion recognition.
arXiv Detail & Related papers (2021-03-16T15:49:15Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.