PromptTTS: Controllable Text-to-Speech with Text Descriptions
- URL: http://arxiv.org/abs/2211.12171v1
- Date: Tue, 22 Nov 2022 10:58:38 GMT
- Title: PromptTTS: Controllable Text-to-Speech with Text Descriptions
- Authors: Zhifang Guo, Yichong Leng, Yihan Wu, Sheng Zhao, Xu Tan
- Abstract summary: We develop a text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.
PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt.
Experiments show that PromptTTS can generate speech with precise style control and high speech quality.
- Score: 32.647362978555485
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Using a text description as prompt to guide the generation of text or images
(e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and
image generation, in this work, we explore the possibility of utilizing text
descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS)
system (dubbed as PromptTTS) that takes a prompt with both style and content
descriptions as input to synthesize the corresponding speech. Specifically,
PromptTTS consists of a style encoder and a content encoder to extract the
corresponding representations from the prompt, and a speech decoder to
synthesize speech according to the extracted style and content representations.
Compared with previous works in controllable TTS that require users to have
acoustic knowledge to understand style factors such as prosody and pitch,
PromptTTS is more user-friendly since text descriptions are a more natural way
to express speech style (e.g., ''A lady whispers to her friend slowly''). Given
that there is no TTS dataset with prompts, to benchmark the task of PromptTTS,
we construct and release a dataset containing prompts with style and content
information and the corresponding speech. Experiments show that PromptTTS can
generate speech with precise style control and high speech quality. Audio
samples and our dataset are publicly available.
Related papers
- StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations [12.891344121936902]
We introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective.
We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc.
The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations.
arXiv Detail & Related papers (2024-04-23T11:41:35Z) - PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech
Using Natural Language Descriptions [21.15647416266187]
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions.
We introduce the concept of speaker prompt, which describes voice characteristics designed to be approximately independent of speaking style.
Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt.
arXiv Detail & Related papers (2023-09-15T04:11:37Z) - PromptTTS 2: Describing and Generating Voices with Text Prompt [102.93668747303975]
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information.
Traditional text-to-speech (TTS) methods rely on speech prompts (reference speech) for voice variability.
In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts.
arXiv Detail & Related papers (2023-09-05T14:45:27Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Contextual Expressive Text-to-Speech [25.050361896378533]
We introduce a new task setting, Contextual Text-to-speech (CTTS)
The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text.
We construct a synthetic dataset and develop an effective framework to generate high-quality expressive speech based on the given context.
arXiv Detail & Related papers (2022-11-26T12:06:21Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.