Related papers: ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

URL: http://arxiv.org/abs/2305.13831v1
Date: Tue, 23 May 2023 08:52:00 GMT
Title: ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
Authors: Minki Kang, Wooseok Han, Sung Ju Hwang, Eunho Yang
Abstract summary: ZET-Speech is a zero-shot adaptive emotion-controllable TTS model. It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
Score: 83.07390037152963
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.

Related papers

MPE-TTS: Customized Emotion Zero-Shot Text-To-Speech Using Multi-Modal Prompt [6.554565427680876]
We propose a customized emotion ZS-TTS system based on multi-modal prompt.<n>The system disentangles speech into the content, timbre, emotion and prosody, allowing emotion prompts to be provided as text, image or speech.
arXiv Detail & Related papers (2025-05-24T01:26:02Z)
EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting [48.56693150755667]
EmoVoice is a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control. EmoVoice-DB is a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions.
arXiv Detail & Related papers (2025-04-17T11:50:04Z)
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions. This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance. It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z)
Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z)
EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech [0.0]
State-of-the-art speech models try to get as close as possible to the human voice. Modelling emotions is an essential part of Text-To-Speech (TTS) research. EmoSpeech surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech.
arXiv Detail & Related papers (2023-06-28T19:34:16Z)
Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks. Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations. We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z)
Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space. The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. We propose a new interactive training paradigm for ETTS, denoted as i-ETTS. We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.