Related papers: JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

URL: http://arxiv.org/abs/2203.16852v1
Date: Thu, 31 Mar 2022 07:25:11 GMT
Title: JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech
Authors: Dan Lim, Sunghee Jung, Eesung Kim
Abstract summary: We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
Score: 7.476901945542385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations.

Related papers

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z)
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z)
Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. We propose a differentiable duration method for learning monotonic sequences between input and output. Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z)
Using multiple reference audios and style embedding constraints for speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios. The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation. Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction. In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.