Generating Multilingual Voices Using Speaker Space Translation Based on
Bilingual Speaker Data
- URL: http://arxiv.org/abs/2004.04972v1
- Date: Fri, 10 Apr 2020 10:01:53 GMT
- Title: Generating Multilingual Voices Using Speaker Space Translation Based on
Bilingual Speaker Data
- Authors: Soumi Maiti, Erik Marchi, Alistair Conkie
- Abstract summary: We show that a simple transform in speaker space can be used to control the degree of accent of a synthetic voice in a language.
The same transform can be applied even to monolingual speakers.
- Score: 15.114637085644057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present progress towards bilingual Text-to-Speech which is able to
transform a monolingual voice to speak a second language while preserving
speaker voice quality. We demonstrate that a bilingual speaker embedding space
contains a separate distribution for each language and that a simple transform
in speaker space generated by the speaker embedding can be used to control the
degree of accent of a synthetic voice in a language. The same transform can be
applied even to monolingual speakers.
In our experiments speaker data from an English-Spanish (Mexican) bilingual
speaker was used, and the goal was to enable English speakers to speak Spanish
and Spanish speakers to speak English. We found that the simple transform was
sufficient to convert a voice from one language to the other with a high degree
of naturalness. In one case the transformed voice outperformed a native
language voice in listening tests. Experiments further indicated that the
transform preserved many of the characteristics of the original voice. The
degree of accent present can be controlled and naturalness is relatively
consistent across a range of accent values.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - Multilingual Multiaccented Multispeaker TTS with RADTTS [21.234787964238645]
We present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS.
We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents.
arXiv Detail & Related papers (2023-01-24T22:39:04Z) - Voice-preserving Zero-shot Multiple Accent Conversion [14.218374374305421]
An accent conversion system changes a speaker's accent but preserves that speaker's voice identity.
We use adversarial learning to disentangle accent dependent features while retaining other acoustic characteristics.
Our model generates audio that sound closer to the target accent and like the original speaker.
arXiv Detail & Related papers (2022-11-23T19:51:16Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Towards Natural Bilingual and Code-Switched Speech Synthesis Based on
Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages.
A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech.
The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z) - Latent linguistic embedding for cross-lingual text-to-speech and voice
conversion [44.700803634034486]
Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally.
We show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps.
arXiv Detail & Related papers (2020-10-08T01:25:07Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.