JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech
without Explicit Alignment
- URL: http://arxiv.org/abs/2005.07799v3
- Date: Mon, 5 Oct 2020 02:48:58 GMT
- Title: JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech
without Explicit Alignment
- Authors: Dan Lim, Won Jang, Gyeonghwan O, Heayoung Park, Bongwan Kim, Jaesam
Yoon
- Abstract summary: We propose Jointly trained Duration Informed Transformer (JDI-T)
JDI-T is a feed-forward Transformer with a duration predictor jointly trained without explicit alignments.
We extract the phoneme duration from the autoregressive Transformer on the fly during the joint training.
- Score: 2.7402733069181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Jointly trained Duration Informed Transformer (JDI-T), a
feed-forward Transformer with a duration predictor jointly trained without
explicit alignments in order to generate an acoustic feature sequence from an
input text. In this work, inspired by the recent success of the duration
informed networks such as FastSpeech and DurIAN, we further simplify its
sequential, two-stage training pipeline to a single-stage training.
Specifically, we extract the phoneme duration from the autoregressive
Transformer on the fly during the joint training instead of pretraining the
autoregressive model and using it as a phoneme duration extractor. To our best
knowledge, it is the first implementation to jointly train the feed-forward
Transformer without relying on a pre-trained phoneme duration extractor in a
single training pipeline. We evaluate the effectiveness of the proposed model
on the publicly available Korean Single speaker Speech (KSS) dataset compared
to the baseline text-to-speech (TTS) models trained by ESPnet-TTS.
Related papers
- Joint Audio/Text Training for Transformer Rescorer of Streaming Speech
Recognition [13.542483062256109]
We present our Joint Audio/Text training method for Transformer Rescorer.
Our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer.
arXiv Detail & Related papers (2022-10-31T22:38:28Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.