EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model
- URL: http://arxiv.org/abs/2106.09317v1
- Date: Thu, 17 Jun 2021 08:34:21 GMT
- Title: EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model
- Authors: Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming
Lei, Zhou Zhao
- Abstract summary: We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
- Score: 56.75775793011719
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, there has been an increasing interest in neural speech synthesis.
While the deep neural network achieves the state-of-the-art result in
text-to-speech (TTS) tasks, how to generate a more emotional and more
expressive speech is becoming a new challenge to researchers due to the
scarcity of high-quality emotion speech dataset and the lack of advanced
emotional TTS model. In this paper, we first briefly introduce and publicly
release a Mandarin emotion speech dataset including 9,724 samples with audio
files and its emotion human-labeled annotation. After that, we propose a simple
but efficient architecture for emotional speech synthesis called EMSpeech.
Unlike those models which need additional reference audio as input, our model
could predict emotion labels just from the input text and generate more
expressive speech conditioned on the emotion embedding. In the experiment
phase, we first validate the effectiveness of our dataset by an emotion
classification task. Then we train our model on the proposed dataset and
conduct a series of subjective evaluations. Finally, by showing a comparable
performance in the emotional speech synthesis task, we successfully demonstrate
the ability of the proposed model.
Related papers
- Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions.
This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance.
It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - BLSP-Emo: Towards Empathetic Large Speech-Language Models [34.62210186235263]
We present BLSP-Emo, a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech.
Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses.
arXiv Detail & Related papers (2024-06-06T09:02:31Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech [0.0]
State-of-the-art speech models try to get as close as possible to the human voice.
Modelling emotions is an essential part of Text-To-Speech (TTS) research.
EmoSpeech surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech.
arXiv Detail & Related papers (2023-06-28T19:34:16Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.