Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
- URL: http://arxiv.org/abs/2410.13839v1
- Date: Thu, 17 Oct 2024 17:55:26 GMT
- Title: Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
- Authors: Tan Dat Nguyen, Ji-Hoon Kim, Jeongsoo Choi, Shukjae Choi, Jinseok Park, Younglo Lee, Joon Son Chung,
- Abstract summary: We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training.
Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads.
In experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models.
- Score: 24.472393096460774
- License:
- Abstract: The goal of this paper is to accelerate codec-based speech synthesis systems with minimum sacrifice to speech quality. We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads, resulting in a linear reduction in synthesis time as the number of heads increases. Furthermore, we introduce a novel speculative decoding technique that utilises a Viterbi-based algorithm to select the optimal sequence of generated tokens at each decoding step. In our experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models, with minimal quality trade-off or even improvement in terms of speech intelligibility. Audio samples are available at: multpletokensprediction.github.io/multipletokensprediction.github.io/.
Related papers
- A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge.
We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.