PromptTTS 2: Describing and Generating Voices with Text Prompt
- URL: http://arxiv.org/abs/2309.02285v2
- Date: Thu, 12 Oct 2023 03:05:36 GMT
- Title: PromptTTS 2: Describing and Generating Voices with Text Prompt
- Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu,
Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li,
Sheng Zhao, Tao Qin, Jiang Bian
- Abstract summary: Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information.
Traditional text-to-speech (TTS) methods rely on speech prompts (reference speech) for voice variability.
In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts.
- Score: 102.93668747303975
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech conveys more information than text, as the same word can be uttered in
various voices to convey diverse information. Compared to traditional
text-to-speech (TTS) methods relying on speech prompts (reference speech) for
voice variability, using text prompts (descriptions) is more user-friendly
since speech prompts can be hard to find or may not exist at all. TTS
approaches based on the text prompt face two main challenges: 1) the
one-to-many problem, where not all details about voice variability can be
described in the text prompt, and 2) the limited availability of text prompt
datasets, where vendors and large cost of data labeling are required to write
text prompts for speech. In this work, we introduce PromptTTS 2 to address
these challenges with a variation network to provide variability information of
voice not captured by text prompts, and a prompt generation pipeline to utilize
the large language models (LLM) to compose high quality text prompts.
Specifically, the variation network predicts the representation extracted from
the reference speech (which contains full information about voice variability)
based on the text prompt representation. For the prompt generation pipeline, it
generates text prompts for speech with a speech language understanding model to
recognize voice attributes (e.g., gender, speed) from speech and a large
language model to formulate text prompts based on the recognition results.
Experiments on a large-scale (44K hours) speech dataset demonstrate that
compared to the previous works, PromptTTS 2 generates voices more consistent
with text prompts and supports the sampling of diverse voice variability,
thereby offering users more choices on voice generation. Additionally, the
prompt generation pipeline produces high-quality text prompts, eliminating the
large labeling cost. The demo page of PromptTTS 2 is available online.
Related papers
- Voice Attribute Editing with Text Prompt [48.48628304530097]
This paper introduces a novel task: voice attribute editing with text prompt.
The goal is to make relative modifications to voice attributes according to the actions described in the text prompt.
To solve this task, VoxEditor, an end-to-end generative model, is proposed.
arXiv Detail & Related papers (2024-04-13T00:07:40Z) - On The Open Prompt Challenge In Conditional Audio Generation [25.178010153697976]
Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text.
We treat TTA models as a blackbox'' and address the user prompt challenge with two key insights.
We propose utilizing text-audio alignment as feedback signals via margin ranking learning for audio improvements.
arXiv Detail & Related papers (2023-11-01T23:33:25Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - PromptTTS: Controllable Text-to-Speech with Text Descriptions [32.647362978555485]
We develop a text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.
PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt.
Experiments show that PromptTTS can generate speech with precise style control and high speech quality.
arXiv Detail & Related papers (2022-11-22T10:58:38Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.