Related papers: Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

URL: http://arxiv.org/abs/2505.15380v2
Date: Tue, 03 Jun 2025 03:01:24 GMT
Title: Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Authors: Zijian Lin, Yang Zhang, Yougen Yuan, Yuming Yan, Jinjiang Liu, Zhiyong Wu, Pengfei Hu, Qun Yu,
Abstract summary: Speech Speculative Decoding (SSD) is a novel framework for autoregressive speech synthesis acceleration.<n>SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding.
Score: 21.682444278458433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their deployment in scenarios where inference speed is critical. In this work, we propose Speech Speculative Decoding (SSD), a novel framework for autoregressive speech synthesis acceleration. Specifically, our method employs a lightweight draft model to generate candidate token sequences, which are subsequently verified in parallel by the target model using the proposed SSD framework. Experimental results demonstrate that SSD achieves a significant speedup of 1.4x compared with conventional autoregressive decoding, while maintaining high fidelity and naturalness. Subjective evaluations further validate the effectiveness of SSD in preserving the perceptual quality of the target model while accelerating inference.

Related papers

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy [20.962236229450454]
We introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, to enhance both efficiency and intelligibility in AR speech generation.<n>DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set.
arXiv Detail & Related papers (2025-06-27T08:45:21Z)
Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z)
Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z)
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding [11.128340782271305]
We introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding.<n>Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively.
arXiv Detail & Related papers (2024-10-29T11:12:01Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding [44.77985942208969]
PRoDeliberation is a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions.
arXiv Detail & Related papers (2024-06-12T02:46:17Z)
Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z)
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding [43.659680579686544]
We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding. Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens. As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
arXiv Detail & Related papers (2023-10-09T05:53:05Z)
Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs) GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations. We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z)
Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner. Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z)
STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture. Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.