ESPnet2-TTS: Extending the Edge of TTS Research
- URL: http://arxiv.org/abs/2110.07840v1
- Date: Fri, 15 Oct 2021 03:27:45 GMT
- Title: ESPnet2-TTS: Extending the Edge of TTS Research
- Authors: Tomoki Hayashi and Ryuichi Yamamoto and Takenori Yoshimura and Peter
Wu and Jiatong Shi and Takaaki Saeki and Yooncheol Ju and Yusuke Yasuda and
Shinnosuke Takamichi and Shinji Watanabe
- Abstract summary: ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit.
New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
- Score: 62.92178873052468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS)
toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many
new features, including: on-the-fly flexible pre-processing, joint training
with neural vocoders, and state-of-the-art TTS models with extensions like
full-band E2E text-to-waveform modeling, which simplify the training pipeline
and further enhance TTS performance. The unified design of our recipes enables
users to quickly reproduce state-of-the-art E2E-TTS results. We also provide
many pre-trained models in a unified Python interface for inference, offering a
quick means for users to generate baseline samples and build demos.
Experimental evaluations with English and Japanese corpora demonstrate that our
provided models synthesize utterances comparable to ground-truth ones,
achieving state-of-the-art TTS performance. The toolkit is available online at
https://github.com/espnet/espnet.
Related papers
- Text-To-Speech Synthesis In The Wild [76.71096751337888]
Text-to-speech (TTS) systems are traditionally trained using modest databases of studio-quality, prompted or read speech collected in benign acoustic environments such as anechoic rooms.
We introduce the TTS In the Wild (TITW) dataset, the result of a fully automated pipeline, applied to the VoxCeleb1 dataset commonly used for speaker recognition.
We show that a number of recent TTS models can be trained successfully using TITW-Easy, but that it remains extremely challenging to produce similar results using TITW-Hard.
arXiv Detail & Related papers (2024-09-13T10:58:55Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Enhancing Speech-to-Speech Translation with Multiple TTS Targets [62.18395387305803]
We analyze the effect of changing synthesized target speech for direct S2ST models.
We propose a multi-task framework that jointly optimized the S2ST system with multiple targets from different TTS systems.
arXiv Detail & Related papers (2023-04-10T14:33:33Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via
Non End-to-End Distillation [4.995698126365142]
We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model.
We apply knowledge distillation to a powerful yet large-sized generative TTS teacher model.
Nix-TTS is end-to-end (vocoder-free) with only 5.23M parameters or up to 82% reduction of the teacher model.
arXiv Detail & Related papers (2022-03-29T15:04:26Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - Unified Mandarin TTS Front-end Based on Distilled BERT Model [5.103126953298633]
A pre-trained language model (PLM) based model is proposed to tackle the two most important tasks in TTS front-end.
We use a pre-trained Chinese BERT as the text encoder and employ multi-task learning technique to adapt it to the two TTS front-end tasks.
We are able to run the whole TTS front-end module in a light and unified manner, which is more friendly to deployment on mobile devices.
arXiv Detail & Related papers (2020-12-31T02:34:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.