Multilingual Multiaccented Multispeaker TTS with RADTTS
- URL: http://arxiv.org/abs/2301.10335v1
- Date: Tue, 24 Jan 2023 22:39:04 GMT
- Title: Multilingual Multiaccented Multispeaker TTS with RADTTS
- Authors: Rohan Badlani, Rafael Valle, Kevin J. Shih, Jo\~ao Felipe Santos,
Siddharth Gururani, Bryan Catanzaro
- Abstract summary: We present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS.
We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents.
- Score: 21.234787964238645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We work to create a multilingual speech synthesis system which can generate
speech with the proper accent while retaining the characteristics of an
individual voice. This is challenging to do because it is expensive to obtain
bilingual training data in multiple languages, and the lack of such data
results in strong correlations that entangle speaker, language, and accent,
resulting in poor transfer capabilities. To overcome this, we present a
multilingual, multiaccented, multispeaker speech synthesis model based on
RADTTS with explicit control over accent, language, speaker and fine-grained
$F_0$ and energy features. Our proposed model does not rely on bilingual
training data. We demonstrate an ability to control synthesized accent for any
speaker in an open-source dataset comprising of 7 accents. Human subjective
evaluation demonstrates that our model can better retain a speaker's voice and
accent quality than controlled baselines while synthesizing fluent speech in
all target languages and accents in our dataset.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised
representations [27.157701195636477]
ParrotTTS is a modularized text-to-speech synthesis model.
It can train a multi-speaker variant effectively using transcripts from a single speaker.
It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
arXiv Detail & Related papers (2023-03-01T17:23:12Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Low-Resource Multilingual and Zero-Shot Multispeaker TTS [25.707717591185386]
We show that it is possible for a system to learn speaking a new language using just 5 minutes of training data.
We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker.
arXiv Detail & Related papers (2022-10-21T20:03:37Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.