Related papers: DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

URL: http://arxiv.org/abs/2506.22362v1
Date: Fri, 27 Jun 2025 16:23:07 GMT
Title: DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding
Authors: Yang Yang, Yunpeng Li, George Sung, Shao-Fu Shih, Craig Dooley, Alessio Centazzo, Ramanan Rajeswaran,
Abstract summary: DiffSoundStream is a solution that improves the efficiency of speech tokenization in non-streaming scenarios.<n> Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model.
Score: 12.05169114091718
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to as semantic tokens and acoustic tokens. These tokens are often modeled autoregressively, with the inference speed being constrained by the token rate. In this work, we propose DiffSoundStream, a solution that improves the efficiency of speech tokenization in non-streaming scenarios through two techniques: (1) conditioning the neural codec on semantic tokens to minimize redundancy between semantic and acoustic tokens, and (2) leveraging latent diffusion models to synthesize high-quality waveforms from semantic and coarse-level acoustic tokens. Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard SoundStream model operating at twice the token rate. Additionally, we achieve step-size distillation using just four diffusion sampling steps with only a minor quality loss.

Related papers

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation [65.05719674893999]
We study two different strategies based on token prediction and regression, and introduce a new method based on Schr"odinger Bridge. We examine how different design choices affect machine and human perception.
arXiv Detail & Related papers (2024-10-29T18:29:39Z)
DM-Codec: Distilling Multimodal Representations for Speech Tokenization [11.433520275513803]
DM-Codec is a language model-guided distillation method that incorporates contextual information. It significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.
arXiv Detail & Related papers (2024-10-19T07:14:14Z)
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding [24.472393096460774]
We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training. Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads. In experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models.
arXiv Detail & Related papers (2024-10-17T17:55:26Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
How Should We Extract Discrete Audio Tokens from Self-Supervised Models? [15.03039528965825]
This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers.
arXiv Detail & Related papers (2024-06-15T20:43:07Z)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.