DiscreTalk: Text-to-Speech as a Machine Translation Problem
- URL: http://arxiv.org/abs/2005.05525v1
- Date: Tue, 12 May 2020 02:45:09 GMT
- Title: DiscreTalk: Text-to-Speech as a Machine Translation Problem
- Authors: Tomoki Hayashi and Shinji Watanabe
- Abstract summary: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT)
The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
- Score: 52.33785857500754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on
neural machine translation (NMT). The proposed model consists of two
components; a non-autoregressive vector quantized variational autoencoder
(VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model
learns a mapping function from a speech waveform into a sequence of discrete
symbols, and then the Transformer-NMT model is trained to estimate this
discrete symbol sequence from a given input text. Since the VQ-VAE model can
learn such a mapping in a fully-data-driven manner, we do not need to consider
hyperparameters of the feature extraction required in the conventional E2E-TTS
models. Thanks to the use of discrete symbols, we can use various techniques
developed in NMT and automatic speech recognition (ASR) such as beam search,
subword units, and fusions with a language model. Furthermore, we can avoid an
over smoothing problem of predicted features, which is one of the common issues
in TTS. The experimental evaluation with the JSUT corpus shows that the
proposed method outperforms the conventional Transformer-TTS model with a
non-autoregressive neural vocoder in naturalness, achieving the performance
comparable to the reconstruction of the VQ-VAE model.
Related papers
- Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic
Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer.
We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z) - Structured State Space Decoder for Speech Recognition and Synthesis [9.354721572095272]
A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks.
In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder.
For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
arXiv Detail & Related papers (2022-10-31T06:54:23Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Cross-Modal Transformer-Based Neural Correction Models for Automatic
Speech Recognition [31.2558640840697]
We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition system.
Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
arXiv Detail & Related papers (2021-07-04T07:58:31Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z) - Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model.
VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE.
We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.