Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis
- URL: http://arxiv.org/abs/2005.10390v2
- Date: Wed, 7 Oct 2020 04:18:56 GMT
- Title: Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis
- Authors: Yusuke Yasuda, Xin Wang, Junichi Yamagishi
- Abstract summary: Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
- Score: 48.151894340550385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce
high-quality speech directly from text or simple linguistic features such as
phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS
does not require manually annotated and complicated linguistic features such as
part-of-speech tags and syntactic structures for system training. However, it
must be carefully designed and well optimized so that it can implicitly extract
useful linguistic features from the input features. In this paper we
investigate under what conditions the neural sequence-to-sequence TTS can work
well in Japanese and English along with comparisons with deep neural network
(DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline
systems also use autoregressive probabilistic modeling and a neural vocoder. We
investigated systems from three aspects: a) model architecture, b) model
parameter size, and c) language. For the model architecture aspect, we adopt
modified Tacotron systems that we previously proposed and their variants using
an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we
investigate two model parameter sizes. For the language aspect, we conduct
listening tests in both Japanese and English to see if our findings can be
generalized across languages. Our experiments suggest that a) a neural
sequence-to-sequence TTS system should have a sufficient number of model
parameters to produce high quality speech, b) it should also use a powerful
encoder when it takes characters as inputs, and c) the encoder still has a room
for improvement and needs to have an improved architecture to learn
supra-segmental features more appropriately.
Related papers
- On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - OverFlow: Putting flows on top of neural transducers for better TTS [9.346907121576258]
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.
In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics.
arXiv Detail & Related papers (2022-11-13T12:53:05Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin
Speech Recognition with a Syllable-to-Character Converter [10.262490936452688]
This paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T.
By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets.
arXiv Detail & Related papers (2020-11-17T06:42:47Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - DiscreTalk: Text-to-Speech as a Machine Translation Problem [52.33785857500754]
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
arXiv Detail & Related papers (2020-05-12T02:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.