Improve Bilingual TTS Using Dynamic Language and Phonology Embedding
- URL: http://arxiv.org/abs/2212.03435v1
- Date: Wed, 7 Dec 2022 03:46:18 GMT
- Title: Improve Bilingual TTS Using Dynamic Language and Phonology Embedding
- Authors: Fengyu Yang, Jian Luan, Yujun Wang
- Abstract summary: This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker.
We specially design an embedding strength modulator to capture the dynamic strength of language and phonology.
- Score: 10.244215079409797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In most cases, bilingual TTS needs to handle three types of input scripts:
first language only, second language only, and second language embedded in the
first language. In the latter two situations, the pronunciation and intonation
of the second language are usually quite different due to the influence of the
first language. Therefore, it is a big challenge to accurately model the
pronunciation and intonation of the second language in different contexts
without mutual interference. This paper builds a Mandarin-English TTS system to
acquire more standard spoken English speech from a monolingual Chinese speaker.
We introduce phonology embedding to capture the English differences between
different phonology. Embedding mask is applied to language embedding for
distinguishing information between different languages and to phonology
embedding for focusing on English expression. We specially design an embedding
strength modulator to capture the dynamic strength of language and phonology.
Experiments show that our approach can produce significantly more natural and
standard spoken English speech of the monolingual Chinese speaker. From
analysis, we find that suitable phonology control contributes to better
performance in different scenarios.
Related papers
- What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Multilingual BERT has an accent: Evaluating English influences on
fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages.
We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Natural Bilingual and Code-Switched Speech Synthesis Based on
Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages.
A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech.
The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.