LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search
- URL: http://arxiv.org/abs/2102.04040v1
- Date: Mon, 8 Feb 2021 07:45:06 GMT
- Title: LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search
- Authors: Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong
Chen, Tie-Yan Liu
- Abstract summary: We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech.
Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
- Score: 127.56834100382878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text to speech (TTS) has been broadly used to synthesize natural and
intelligible speech in different scenarios. Deploying TTS in various end
devices such as mobile phones or embedded devices requires extremely small
memory usage and inference latency. While non-autoregressive TTS models such as
FastSpeech have achieved significantly faster inference speed than
autoregressive models, their model size and inference latency are still large
for the deployment in resource constrained devices. In this paper, we propose
LightSpeech, which leverages neural architecture search~(NAS) to automatically
design more lightweight and efficient models based on FastSpeech. We first
profile the components of current FastSpeech model and carefully design a novel
search space containing various lightweight and potentially effective
architectures. Then NAS is utilized to automatically discover well performing
architectures within the search space. Experiments show that the model
discovered by our method achieves 15x model compression ratio and 6.5x
inference speedup on CPU with on par voice quality. Audio demos are provided at
https://speechresearch.github.io/lightspeech.
Related papers
- Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [7.2129341612013285]
We introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA)
This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes.
arXiv Detail & Related papers (2024-10-30T04:50:40Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices.
In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z) - On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system.
We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices.
The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - FastSpeech 2: Fast and High-Quality End-to-End Text to Speech [189.05831125931053]
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality.
FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss.
We propose FastSpeech 2, which directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch,
arXiv Detail & Related papers (2020-06-08T13:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.