Scalable Multilingual Frontend for TTS
- URL: http://arxiv.org/abs/2004.04934v1
- Date: Fri, 10 Apr 2020 08:00:40 GMT
- Title: Scalable Multilingual Frontend for TTS
- Authors: Alistair Conkie, Andrew Finch
- Abstract summary: This paper describes progress towards making a Neural Text-to-Speech (TTS) Frontend that works for many languages and can be easily extended to new languages.
We take a Machine Translation inspired approach to constructing, and model both text normalization and pronunciation on a sentence level by building and using sequence-to-sequence (S2S) models.
For our language-independent approach to pronunciation we do not use a lexicon. Instead all pronunciations, including context-based pronunciations, are captured in the S2S model.
- Score: 4.1203601403593275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes progress towards making a Neural Text-to-Speech (TTS)
Frontend that works for many languages and can be easily extended to new
languages. We take a Machine Translation (MT) inspired approach to constructing
the frontend, and model both text normalization and pronunciation on a sentence
level by building and using sequence-to-sequence (S2S) models. We experimented
with training normalization and pronunciation as separate S2S models and with
training a single S2S model combining both functions.
For our language-independent approach to pronunciation we do not use a
lexicon. Instead all pronunciations, including context-based pronunciations,
are captured in the S2S model. We also present a language-independent chunking
and splicing technique that allows us to process arbitrary-length sentences.
Models for 18 languages were trained and evaluated. Many of the accuracy
measurements are above 99%. We also evaluated the models in the context of
end-to-end synthesis against our current production system.
Related papers
- Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z) - One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech [3.42658286826597]
We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation.
Our model is shown to effectively share information across languages and according to a subjective evaluation test, it produces more natural and accurate code-switching speech than the baselines.
arXiv Detail & Related papers (2020-08-03T10:43:30Z) - Neural Machine Translation for Multilingual Grapheme-to-Phoneme
Conversion [13.543705472805431]
We present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages.
We show 7.2% average improvement in phoneme error rate over low resource languages and no over high resource ones compared to monolingual baselines.
arXiv Detail & Related papers (2020-06-25T06:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.