Related papers: DiscreTalk: Text-to-Speech as a Machine Translation Problem

DiscreTalk: Text-to-Speech as a Machine Translation Problem

URL: http://arxiv.org/abs/2005.05525v1
Date: Tue, 12 May 2020 02:45:09 GMT
Title: DiscreTalk: Text-to-Speech as a Machine Translation Problem
Authors: Tomoki Hayashi and Shinji Watanabe
Abstract summary: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT) The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model.
Score: 52.33785857500754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based on neural machine translation (NMT). The proposed model consists of two components; a non-autoregressive vector quantized variational autoencoder (VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE model learns a mapping function from a speech waveform into a sequence of discrete symbols, and then the Transformer-NMT model is trained to estimate this discrete symbol sequence from a given input text. Since the VQ-VAE model can learn such a mapping in a fully-data-driven manner, we do not need to consider hyperparameters of the feature extraction required in the conventional E2E-TTS models. Thanks to the use of discrete symbols, we can use various techniques developed in NMT and automatic speech recognition (ASR) such as beam search, subword units, and fusions with a language model. Furthermore, we can avoid an over smoothing problem of predicted features, which is one of the common issues in TTS. The experimental evaluation with the JSUT corpus shows that the proposed method outperforms the conventional Transformer-TTS model with a non-autoregressive neural vocoder in naturalness, achieving the performance comparable to the reconstruction of the VQ-VAE model.

Related papers

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z)
Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer. We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z)
Structured State Space Decoder for Speech Recognition and Synthesis [9.354721572095272]
A structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks. In this study, we applied S4 as a decoder for ASR and text-to-speech tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88%/4.25%.
arXiv Detail & Related papers (2022-10-31T06:54:23Z)
Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction. It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition. We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z)
Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition [31.2558640840697]
We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition system. Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
arXiv Detail & Related papers (2021-07-04T07:58:31Z)
Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models. We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z)
Variational Transformers for Diverse Response Generation [71.53159402053392]
Variational Transformer (VT) is a variational self-attentive feed-forward sequence model. VT combines the parallelizability and global receptive field computation of the Transformer with the variational nature of the CVAE. We explore two types of VT: 1) modeling the discourse-level diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables.
arXiv Detail & Related papers (2020-03-28T07:48:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.