Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
- URL: http://arxiv.org/abs/2409.07265v1
- Date: Wed, 11 Sep 2024 13:40:27 GMT
- Title: Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
- Authors: Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari,
- Abstract summary: Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects.
We present a novel TTS model comprising three sub-modules to perform competitively at this task.
- Score: 29.167336994990542
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.
Related papers
- Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for
Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model.
In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker.
In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Modelling low-resource accents without accent-specific TTS frontend [4.185844990558149]
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS)
We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion.
We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
arXiv Detail & Related papers (2023-01-11T18:00:29Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.