PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control
- URL: http://arxiv.org/abs/2501.06276v1
- Date: Fri, 10 Jan 2025 12:10:30 GMT
- Title: PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control
- Authors: Shaozuo Zhang, Ambuj Mehrish, Yingting Li, Soujanya Poria,
- Abstract summary: We introduce an approach centered on prompt-based emotion control.
The proposed architecture incorporates emotion and intensity control across multi-speakers.
We leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content.
- Score: 20.873353104077857
- License:
- Abstract: Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
Related papers
- EmoSpeech: A Corpus of Emotionally Rich and Contextually Detailed Speech Annotations [1.9827837167752067]
The development of text-to-speech (TTS) systems capable of controlling subtle emotional differences remains a formidable challenge.
Existing emotional speech databases often suffer from overly simplistic labelling schemes that fail to capture a wide range of emotional states.
We propose a novel process aimed at building databases by systematically extracting emotion-rich speech segments and annotating them with detailed natural language descriptions.
arXiv Detail & Related papers (2024-12-09T15:36:37Z) - EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control [7.596581158724187]
EmoKnob is a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion.
We show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
arXiv Detail & Related papers (2024-10-01T01:29:54Z) - Controlling Emotion in Text-to-Speech with Natural Language Prompts [29.013577423045255]
We propose a system conditioned on embeddings derived from an emotionally rich text iteration that serves as prompt.
A joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture.
Our approach is trained on merged emotional speech and text datasets and varies prompts in each training to increase the generalization capabilities of the model.
arXiv Detail & Related papers (2024-06-10T15:58:42Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.
EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Semi-supervised learning for continuous emotional intensity controllable
speech synthesis with disentangled representations [16.524515747017787]
We propose a novel method to control the continuous intensity of emotions using semi-supervised learning.
The experimental results showed that the proposed method was superior in controllability and naturalness.
arXiv Detail & Related papers (2022-11-11T12:28:07Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.