Description-based Controllable Text-to-Speech with Cross-Lingual Voice
Control
- URL: http://arxiv.org/abs/2409.17452v1
- Date: Thu, 26 Sep 2024 01:08:09 GMT
- Title: Description-based Controllable Text-to-Speech with Cross-Lingual Voice
Control
- Authors: Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana
- Abstract summary: We propose a novel controllable text-to-speech (TTS) method with cross-lingual control capability.
We combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model.
Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages.
- Score: 14.145510487599932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel description-based controllable text-to-speech (TTS) method
with cross-lingual control capability. To address the lack of audio-description
paired data in the target language, we combine a TTS model trained on the
target language with a description control model trained on another language,
which maps input text descriptions to the conditional features of the TTS
model. These two models share disentangled timbre and style representations
based on self-supervised learning (SSL), allowing for disentangled voice
control, such as controlling speaking styles while retaining the original
timbre. Furthermore, because the SSL-based timbre and style representations are
language-agnostic, combining the TTS and description control models while
sharing the same embedding space effectively enables cross-lingual control of
voice characteristics. Experiments on English and Japanese TTS demonstrate that
our method achieves high naturalness and controllability for both languages,
even though no Japanese audio-description pairs are used.
Related papers
- StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [13.713209707407712]
StyleSpeech is a novel Text-to-Speech(TTS) system that enhances the naturalness and accuracy of synthesized speech.
Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features.
LoRA allows efficient adaptation of style features in pre-trained models.
arXiv Detail & Related papers (2024-08-27T00:37:07Z) - ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec [50.273832905535485]
We present ControlSpeech, a text-to-speech (TTS) system capable of fully mimicking the speaker's voice and enabling arbitrary control and adjustment of speaking style.
Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation.
arXiv Detail & Related papers (2024-06-03T11:15:16Z) - Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised
representations [27.157701195636477]
ParrotTTS is a modularized text-to-speech synthesis model.
It can train a multi-speaker variant effectively using transcripts from a single speaker.
It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
arXiv Detail & Related papers (2023-03-01T17:23:12Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.