Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS
- URL: http://arxiv.org/abs/2008.05284v1
- Date: Tue, 11 Aug 2020 07:57:29 GMT
- Title: Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS
- Authors: Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao and Haizhou Li
- Abstract summary: We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
- Score: 74.11899135025503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tacotron-based end-to-end speech synthesis has shown remarkable voice
quality. However, the rendering of prosody in the synthesized speech remains to
be improved, especially for long sentences, where prosodic phrasing errors can
occur frequently. In this paper, we extend the Tacotron-based speech synthesis
framework to explicitly model the prosodic phrase breaks. We propose a
multi-task learning scheme for Tacotron training, that optimizes the system to
predict both Mel spectrum and phrase breaks. To our best knowledge, this is the
first implementation of multi-task learning for Tacotron based TTS with a
prosodic phrasing model. Experiments show that our proposed training scheme
consistently improves the voice quality for both Chinese and Mongolian systems.
Related papers
- MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Textless Direct Speech-to-Speech Translation with Discrete Speech
Representation [27.182170555234226]
We propose a novel model, Textless Translatotron, for training an end-to-end direct S2ST model without any textual supervision.
When a speech encoder pre-trained with unsupervised speech data is used for both models, the proposed model obtains translation quality nearly on-par with Translatotron 2.
arXiv Detail & Related papers (2022-10-31T19:48:38Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Scalable Multilingual Frontend for TTS [4.1203601403593275]
This paper describes progress towards making a Neural Text-to-Speech (TTS) Frontend that works for many languages and can be easily extended to new languages.
We take a Machine Translation inspired approach to constructing, and model both text normalization and pronunciation on a sentence level by building and using sequence-to-sequence (S2S) models.
For our language-independent approach to pronunciation we do not use a lexicon. Instead all pronunciations, including context-based pronunciations, are captured in the S2S model.
arXiv Detail & Related papers (2020-04-10T08:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.