Related papers: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

URL: http://arxiv.org/abs/2102.04040v1
Date: Mon, 8 Feb 2021 07:45:06 GMT
Title: LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Authors: Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu
Abstract summary: We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
Score: 127.56834100382878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text to speech (TTS) has been broadly used to synthesize natural and intelligible speech in different scenarios. Deploying TTS in various end devices such as mobile phones or embedded devices requires extremely small memory usage and inference latency. While non-autoregressive TTS models such as FastSpeech have achieved significantly faster inference speed than autoregressive models, their model size and inference latency are still large for the deployment in resource constrained devices. In this paper, we propose LightSpeech, which leverages neural architecture search~(NAS) to automatically design more lightweight and efficient models based on FastSpeech. We first profile the components of current FastSpeech model and carefully design a novel search space containing various lightweight and potentially effective architectures. Then NAS is utilized to automatically discover well performing architectures within the search space. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality. Audio demos are provided at https://speechresearch.github.io/lightspeech.

Related papers

Efficient Interleaved Speech Modeling through Knowledge Distillation [5.389972857470079]
Current speech language models exceed the size and latency constraints of many deployment environments.<n>We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits.<n>TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations.
arXiv Detail & Related papers (2025-06-30T09:47:37Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z)
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
Long-Form Speech Generation with Spoken Language Models [64.29591880693468]
SpeechSSM learns from and sample long-form spoken audio in a single decoding session without text intermediates. New embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long.
arXiv Detail & Related papers (2024-12-24T18:56:46Z)
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [7.2129341612013285]
We introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA) This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes.
arXiv Detail & Related papers (2024-10-30T04:50:40Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
EfficientSpeech: An On-Device Text to Speech Model [15.118059441365343]
State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. In this work, an efficient neural TTS called EfficientSpeech that synthesizes speech on an ARM CPU in real-time is proposed.
arXiv Detail & Related papers (2023-05-23T10:28:41Z)
On-device neural speech synthesis [3.716815259884143]
Tacotron and WaveRNN have made it possible to construct a fully neural network based TTS system. We present key modeling improvements and optimization strategies that enable deploying these models on GPU servers and on mobile devices. The proposed system can generate high-quality 24 kHz speech at 5x faster than real time on server and 3x faster than real time on mobile devices.
arXiv Detail & Related papers (2021-09-17T18:31:31Z)
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user. We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z)
FastSpeech 2: Fast and High-Quality End-to-End Text to Speech [189.05831125931053]
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss. We propose FastSpeech 2, which directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch,
arXiv Detail & Related papers (2020-06-08T13:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.