Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
        - URL: http://arxiv.org/abs/2409.17452v1
- Date: Thu, 26 Sep 2024 01:08:09 GMT
- Title: Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
- Authors: Ryuichi Yamamoto, Yuma Shirahata, Masaya Kawamura, Kentaro Tachibana
- Abstract summary: We propose a novel controllable text-to-speech (TTS) method with cross-lingual control capability.
We combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model.
Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages.
- Score: 14.145510487599932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   We propose a novel description-based controllable text-to-speech (TTS) method
with cross-lingual control capability. To address the lack of audio-description
paired data in the target language, we combine a TTS model trained on the
target language with a description control model trained on another language,
which maps input text descriptions to the conditional features of the TTS
model. These two models share disentangled timbre and style representations
based on self-supervised learning (SSL), allowing for disentangled voice
control, such as controlling speaking styles while retaining the original
timbre. Furthermore, because the SSL-based timbre and style representations are
language-agnostic, combining the TTS and description control models while
sharing the same embedding space effectively enables cross-lingual control of
voice characteristics. Experiments on English and Japanese TTS demonstrate that
our method achieves high naturalness and controllability for both languages,
even though no Japanese audio-description pairs are used.
 
      
        Related papers
        - Generalized Multilingual Text-to-Speech Generation with Language-Aware   Style Adaptation [18.89091877062589]
 LanStyleTTS is a non-autoregressive, language-aware style adaptive TTS framework.
It supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models.
 arXiv  Detail & Related papers  (2025-04-11T06:12:57Z)
- StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained   Controllable Text-to-Speech [13.713209707407712]
 StyleSpeech is a novel Text-to-Speech(TTS) system that enhances the naturalness and accuracy of synthesized speech.
Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features.
LoRA allows efficient adaptation of style features in pre-trained models.
 arXiv  Detail & Related papers  (2024-08-27T00:37:07Z)
- ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and   Zero-shot Language Style Control With Decoupled Codec [50.273832905535485]
 We present ControlSpeech, a text-to-speech (TTS) system capable of fully mimicking the speaker's voice and enabling arbitrary control and adjustment of speaking style.
Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation.
 arXiv  Detail & Related papers  (2024-06-03T11:15:16Z)
- Expressive TTS Driven by Natural Language Prompts Using Few Human
  Annotations [12.891344121936902]
 Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
 arXiv  Detail & Related papers  (2023-11-02T14:20:37Z)
- Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
 Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
 arXiv  Detail & Related papers  (2023-09-14T09:52:08Z)
- TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
  Text-to-Speech Models [51.529485094900934]
 We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
 arXiv  Detail & Related papers  (2023-08-28T09:06:32Z)
- Textless Unit-to-Unit training for Many-to-Many Multilingual   Speech-to-Speech Translation [65.13824257448564]
 This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
 arXiv  Detail & Related papers  (2023-08-03T15:47:04Z)
- Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
  Bias [71.94109664001952]
 Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
 arXiv  Detail & Related papers  (2023-06-06T08:54:49Z)
- ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised
  representations [27.157701195636477]
 ParrotTTS is a modularized text-to-speech synthesis model.
It can train a multi-speaker variant effectively using transcripts from a single speaker.
It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
 arXiv  Detail & Related papers  (2023-03-01T17:23:12Z)
- ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
  Multi-Speaker Text-to-Speech [58.93395189153713]
 We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
 arXiv  Detail & Related papers  (2022-11-07T13:35:16Z)
- Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
  Improved Pronunciation [11.336431583289382]
 This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
 arXiv  Detail & Related papers  (2022-10-31T12:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.