LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search
- URL: http://arxiv.org/abs/2102.04040v1
- Date: Mon, 8 Feb 2021 07:45:06 GMT
- Title: LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search
- Authors: Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong
Chen, Tie-Yan Liu
- Abstract summary: We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech.
Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
- Score: 127.56834100382878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text to speech (TTS) has been broadly used to synthesize natural and
intelligible speech in different scenarios. Deploying TTS in various end
devices such as mobile phones or embedded devices requires extremely small
memory usage and inference latency. While non-autoregressive TTS models such as
FastSpeech have achieved significantly faster inference speed than
autoregressive models, their model size and inference latency are still large
for the deployment in resource constrained devices. In this paper, we propose
LightSpeech, which leverages neural architecture search~(NAS) to automatically
design more lightweight and efficient models based on FastSpeech. We first
profile the components of current FastSpeech model and carefully design a novel
search space containing various lightweight and potentially effective
architectures. Then NAS is utilized to automatically discover well performing
architectures within the search space. Experiments show that the model
discovered by our method achieves 15x model compression ratio and 6.5x
inference speedup on CPU with on par voice quality. Audio demos are provided at
https://speechresearch.github.io/lightspeech.
Related papers
- HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - Application-Agnostic Language Modeling for On-Device ASR [6.03523493247947]
On-device automatic speech recognition systems face several challenges compared to server-based systems.
They have to meet stricter constraints in terms of speed, disk size and memory.
One of our novel approaches reduces the disk size by half, while maintaining speed and accuracy of the original model.
arXiv Detail & Related papers (2023-05-16T19:31:18Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system.
We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices.
The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - FastSpeech 2: Fast and High-Quality End-to-End Text to Speech [189.05831125931053]
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality.
FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss.
We propose FastSpeech 2, which directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch,
arXiv Detail & Related papers (2020-06-08T13:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.