Multilingual context-based pronunciation learning for Text-to-Speech
- URL: http://arxiv.org/abs/2307.16709v1
- Date: Mon, 31 Jul 2023 14:29:06 GMT
- Title: Multilingual context-based pronunciation learning for Text-to-Speech
- Authors: Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime
Lorenzo-Trueba
- Abstract summary: Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
- Score: 13.941800219395757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phonetic information and linguistic knowledge are an essential component of a
Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected
offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in
order to predict the pronunciation for out-of-vocabulary (OOV) words.
Additionally, post-lexical phonology, often defined in the form of rule-based
systems, is used to correct pronunciation within or between words. In this work
we showcase a multilingual unified front-end system that addresses any
pronunciation related task, typically handled by separate modules. We evaluate
the proposed model on G2P conversion and other language-specific challenges,
such as homograph and polyphones disambiguation, post-lexical rules and
implicit diacritization. We find that the multilingual model is competitive
across languages and tasks, however, some trade-offs exists when compared to
equivalent monolingual solutions.
Related papers
- Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and
Phonetic Domains for Speech Representation Learning [0.76146285961466]
We present a methodology for linguistic feature extraction, focusing on automatically syllabifying words in multiple languages.
In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification.
The system was built with open-source components and resources.
arXiv Detail & Related papers (2023-10-17T19:27:23Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Improve Bilingual TTS Using Dynamic Language and Phonology Embedding [10.244215079409797]
This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker.
We specially design an embedding strength modulator to capture the dynamic strength of language and phonology.
arXiv Detail & Related papers (2022-12-07T03:46:18Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Differentiable Allophone Graphs for Language-Universal Speech
Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages.
We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings.
We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - Neural Machine Translation for Multilingual Grapheme-to-Phoneme
Conversion [13.543705472805431]
We present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages.
We show 7.2% average improvement in phoneme error rate over low resource languages and no over high resource ones compared to monolingual baselines.
arXiv Detail & Related papers (2020-06-25T06:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.