PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech
Using Natural Language Descriptions
- URL: http://arxiv.org/abs/2309.08140v2
- Date: Wed, 27 Dec 2023 10:41:36 GMT
- Title: PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech
Using Natural Language Descriptions
- Authors: Reo Shimizu, Ryuichi Yamamoto, Masaya Kawamura, Yuma Shirahata,
Hironori Doi, Tatsuya Komatsu, Kentaro Tachibana
- Abstract summary: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions.
We introduce the concept of speaker prompt, which describes voice characteristics designed to be approximately independent of speaking style.
Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt.
- Score: 21.15647416266187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system
that allows control over speaker identity using natural language descriptions.
To control speaker identity within the prompt-based TTS framework, we introduce
the concept of speaker prompt, which describes voice characteristics (e.g.,
gender-neutral, young, old, and muffled) designed to be approximately
independent of speaking style. Since there is no large-scale dataset containing
speaker prompts, we first construct a dataset based on the LibriTTS-R corpus
with manually annotated speaker prompts. We then employ a diffusion-based
acoustic model with mixture density networks to model diverse speaker factors
in the training data. Unlike previous studies that rely on style prompts
describing only a limited aspect of speaker individuality, such as pitch,
speaking speed, and energy, our method utilizes an additional speaker prompt to
effectively learn the mapping from natural language descriptions to the
acoustic features of diverse speakers. Our subjective evaluation results show
that the proposed method can better control speaker characteristics than the
methods without the speaker prompt. Audio samples are available at
https://reppy4620.github.io/demo.promptttspp/.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - Zero-shot text-to-speech synthesis conditioned using self-supervised
speech representation model [13.572330725278066]
A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data.
The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches.
arXiv Detail & Related papers (2023-04-24T10:15:58Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Cross-speaker style transfer for text-to-speech using data augmentation [11.686745250628247]
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion.
We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers.
We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages.
arXiv Detail & Related papers (2022-02-10T15:10:56Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.