Related papers: Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

URL: http://arxiv.org/abs/2010.08136v1
Date: Fri, 16 Oct 2020 03:51:00 GMT
Title: Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion
Authors: Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, Bin Ma
Abstract summary: It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
Score: 28.830575877307176
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent state-of-the-art neural text-to-speech (TTS) synthesis models have dramatically improved intelligibility and naturalness of generated speech from text. However, building a good bilingual or code-switched TTS for a particular voice is still a challenge. The main reason is that it is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. In this paper, we explore the use of Mandarin speech recordings from a Mandarin speaker, and English speech recordings from another English speaker to build high-quality bilingual and code-switched TTS for both speakers. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech, which show good naturalness and speaker similarity. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model. With these data, three neural TTS models -- Tacotron2, Transformer and FastSpeech are applied for building bilingual and code-switched TTS. Subjective evaluation results show that all the three systems can produce (near-)native-level speech in both languages for each of the speaker.

Related papers

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation [25.82932373649325]
CrossSpeech++ is a method to disentangle language and speaker information. It significantly improves the quality of cross-lingual speech synthesis. We conduct extensive experiments using various metrics, and demonstrate that CrossSpeech++ achieves significant improvements.
arXiv Detail & Related papers (2024-12-28T06:32:49Z)
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST) We use discretized speech units, which are generated in a fully unsupervised way. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z)
MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model. It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone. We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
Latent linguistic embedding for cross-lingual text-to-speech and voice conversion [44.700803634034486]
Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. We show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps.
arXiv Detail & Related papers (2020-10-08T01:25:07Z)
Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
Generating Multilingual Voices Using Speaker Space Translation Based on Bilingual Speaker Data [15.114637085644057]
We show that a simple transform in speaker space can be used to control the degree of accent of a synthetic voice in a language. The same transform can be applied even to monolingual speakers.
arXiv Detail & Related papers (2020-04-10T10:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.