TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models
- URL: http://arxiv.org/abs/2308.14430v1
- Date: Mon, 28 Aug 2023 09:06:32 GMT
- Title: TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models
- Authors: Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen,
Xinyu Duan, Baoxing Huai, Zhou Zhao
- Abstract summary: We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
- Score: 51.529485094900934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been a growing interest in the field of controllable
Text-to-Speech (TTS). While previous studies have relied on users providing
specific style factor values based on acoustic knowledge or selecting reference
speeches that meet certain requirements, generating speech solely from natural
text prompts has emerged as a new challenge for researchers. This challenge
arises due to the scarcity of high-quality speech datasets with natural text
style prompt and the absence of advanced text-controllable TTS models. In light
of this, 1) we propose TextrolSpeech, which is the first large-scale speech
emotion dataset annotated with rich text attributes. The dataset comprises
236,220 pairs of style prompt in natural text descriptions with five style
factors and corresponding speech samples. Through iterative experimentation, we
introduce a multi-stage prompt programming approach that effectively utilizes
the GPT model for generating natural style descriptions in large volumes. 2)
Furthermore, to address the need for generating audio with greater style
diversity, we propose an efficient architecture called Salle. This architecture
treats text controllable TTS as a language model task, utilizing audio codec
codes as an intermediate representation to replace the conventional
mel-spectrogram. Finally, we successfully demonstrate the ability of the
proposed model by showing a comparable performance in the controllable TTS
task. Audio samples are available at https://sall-e.github.io/
Related papers
- ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [50.273832905535485]
We present ControlSpeech, a text-to-speech (TTS) system capable of fully mimicking the speaker's voice and enabling arbitrary control and adjustment of speaking style.
Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation.
arXiv Detail & Related papers (2024-06-03T11:15:16Z) - BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
on 100K hours of data [15.447206120523356]
BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data.
We show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
arXiv Detail & Related papers (2024-02-12T22:21:30Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - PromptTTS: Controllable Text-to-Speech with Text Descriptions [32.647362978555485]
We develop a text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.
PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt.
Experiments show that PromptTTS can generate speech with precise style control and high speech quality.
arXiv Detail & Related papers (2022-11-22T10:58:38Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.