Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via
Non End-to-End Distillation
- URL: http://arxiv.org/abs/2203.15643v1
- Date: Tue, 29 Mar 2022 15:04:26 GMT
- Title: Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via
Non End-to-End Distillation
- Authors: Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji
- Abstract summary: We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model.
We apply knowledge distillation to a powerful yet large-sized generative TTS teacher model.
Nix-TTS is end-to-end (vocoder-free) with only 5.23M parameters or up to 82% reduction of the teacher model.
- Score: 4.995698126365142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved
by applying knowledge distillation to a powerful yet large-sized generative TTS
teacher model. Distilling a TTS model might sound unintuitive due to the
generative and disjointed nature of TTS architectures, but pre-trained TTS
models can be simplified into encoder and decoder structures, where the former
encodes text into some latent representation and the latter decodes the latent
into speech data. We devise a framework to distill each component in a non
end-to-end fashion. Nix-TTS is end-to-end (vocoder-free) with only 5.23M
parameters or up to 82\% reduction of the teacher model, it achieves over
3.26$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry
Pi respectively, and still retains a fair voice naturalness and intelligibility
compared to the teacher model. We publicly release Nix-TTS pretrained models
and audio samples in English (https://github.com/rendchevi/nix-tts).
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
on 100K hours of data [15.447206120523356]
BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data.
We show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.
arXiv Detail & Related papers (2024-02-12T22:21:30Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - ESPnet2-TTS: Extending the Edge of TTS Research [62.92178873052468]
ESPnet2-TTS is an end-to-end text-to-speech (E2E-TTS) toolkit.
New features include: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling.
arXiv Detail & Related papers (2021-10-15T03:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.