Improving Cross-lingual Speech Synthesis with Triplet Training Scheme
- URL: http://arxiv.org/abs/2202.10729v1
- Date: Tue, 22 Feb 2022 08:40:43 GMT
- Title: Improving Cross-lingual Speech Synthesis with Triplet Training Scheme
- Authors: Jianhao Ye, Hongbin Zhou, Zhiba Su, Wendi He, Kaimeng Ren, Lin Li,
Heng Lu
- Abstract summary: Triplet training scheme is proposed to enhance the cross-lingual pronunciation.
The proposed method brings significant improvement in both intelligibility and naturalness of the synthesized cross-lingual speech.
- Score: 5.470211567548067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in cross-lingual text-to-speech (TTS) made it possible to
synthesize speech in a language foreign to a monolingual speaker. However,
there is still a large gap between the pronunciation of generated cross-lingual
speech and that of native speakers in terms of naturalness and intelligibility.
In this paper, a triplet training scheme is proposed to enhance the
cross-lingual pronunciation by allowing previously unseen content and speaker
combinations to be seen during training. Proposed method introduces an extra
fine-tune stage with triplet loss during training, which efficiently draws the
pronunciation of the synthesized foreign speech closer to those from the native
anchor speaker, while preserving the non-native speaker's timbre. Experiments
are conducted based on a state-of-the-art baseline cross-lingual TTS system and
its enhanced variants. All the objective and subjective evaluations show the
proposed method brings significant improvement in both intelligibility and
naturalness of the synthesized cross-lingual speech.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech [0.3277163122167433]
SANE-TTS is a stable and natural end-to-end multilingual TTS model.
We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis.
Our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis.
arXiv Detail & Related papers (2022-06-24T07:53:05Z) - Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker
Classifier Joint Training [6.256271702518489]
In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker.
This paper studies a multi-task learning framework to improve the cross-lingual speaker similarity.
arXiv Detail & Related papers (2022-01-20T12:02:58Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Towards Natural Bilingual and Code-Switched Speech Synthesis Based on
Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages.
A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech.
The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z) - Latent linguistic embedding for cross-lingual text-to-speech and voice
conversion [44.700803634034486]
Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally.
We show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps.
arXiv Detail & Related papers (2020-10-08T01:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.