A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation
- URL: http://arxiv.org/abs/2206.04922v1
- Date: Fri, 10 Jun 2022 07:46:34 GMT
- Title: A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation
- Authors: Wudi Bao, Junhui Zhang, Junjie Pan, Xiang Yin
- Abstract summary: We propose a novel Chinese dialect TTS with a translation module.
It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar.
It is the first known work to incorporate translation with TTS.
- Score: 6.090922774386845
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Chinese dialect text-to-speech(TTS) system usually can only be utilized by
native linguists, because the written form of Chinese dialects has different
characters, idioms, grammar and usage from Mandarin, and even the local speaker
cannot input a correct sentence. For Mandarin text inputs, Chinese dialect TTS
can only generate partly-meaningful speech with relatively poor prosody and
naturalness. To lower the bar of use and make it more practical in commercial,
we propose a novel Chinese dialect TTS frontend with a translation module. It
helps to convert Mandarin text into idiomatic expressions with correct
orthography and grammar, so that the intelligibility and naturalness of the
synthesized speech can be improved. A non-autoregressive neural machine
translation model with a glancing sampling strategy is proposed for the
translation task. It is the first known work to incorporate translation with
TTS frontend. Our experiments on Cantonese approve that the proposed frontend
can help Cantonese TTS system achieve a 0.27 improvement in MOS with Mandarin
inputs.
Related papers
- Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation [3.9166923630129604]
Bailing-TTS is a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech.
The Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes.
Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation.
arXiv Detail & Related papers (2024-08-01T04:57:31Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Improve Bilingual TTS Using Dynamic Language and Phonology Embedding [10.244215079409797]
This paper builds a Mandarin-English TTS system to acquire more standard spoken English speech from a monolingual Chinese speaker.
We specially design an embedding strength modulator to capture the dynamic strength of language and phonology.
arXiv Detail & Related papers (2022-12-07T03:46:18Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - A Study of Modeling Rising Intonation in Cantonese Neural Speech
Synthesis [10.747119651974947]
Declarative questions are commonly used in daily Cantonese conversations.
Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences.
We propose to complement the Cantonese TTS model with a BERT-based statement/question classifier.
arXiv Detail & Related papers (2022-08-03T16:21:08Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Towards Natural Bilingual and Code-Switched Speech Synthesis Based on
Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages.
A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech.
The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z) - Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z) - g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin
Chinese Based on a New Open Benchmark Dataset [14.323478990713477]
We introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.
We train a simple neural network model on it, and find that it outperforms other preexisting G2P systems.
arXiv Detail & Related papers (2020-04-07T05:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.