Related papers: VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

URL: http://arxiv.org/abs/2602.08607v1
Date: Mon, 09 Feb 2026 12:52:59 GMT
Title: VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling
Authors: Ziyang Cheng, Yuhao Wang, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang,
Abstract summary: Masked Diffusion Modeling(MDM) is a non-autoregressive paradigm for speech LLMs.<n> VocalNet-MDM is trained on a limited scale of only 6K hours of speech data.<n>It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness.
Score: 31.58493743596625
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.

Related papers

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding [36.74241893088594]
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation.<n>Recent works have accelerated inference via KV cache reuse or decoding, but overlook the intrinsic inefficiencies within the block-wise diffusion process.<n>We propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions.
arXiv Detail & Related papers (2026-01-25T17:36:04Z)
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation [60.02195766025208]
We present AR- Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders.<n>AR- Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder.<n>We address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism.
arXiv Detail & Related papers (2026-01-25T09:17:36Z)
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models [63.50827603618498]
We propose Sparse-LaViDa, a modeling framework that truncates unnecessary masked tokens at each inference step to accelerate MDM sampling.<n>Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks.
arXiv Detail & Related papers (2025-12-16T02:06:06Z)
MDiff4STR: Mask Diffusion Model for Scene Text Recognition [59.79818820650126]
Mask Diffusion Models (MDMs) have emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks.<n>We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency.<n>We propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for Scene Text Recognition.
arXiv Detail & Related papers (2025-12-01T08:57:51Z)
VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction [31.58493743596625]
VocalNet-M2 is a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction strategy.<n>Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model.
arXiv Detail & Related papers (2025-11-13T12:06:05Z)
Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z)
Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models [13.575063025878208]
Masked diffusion language models promise fast, non-autoregressive text generation.<n>Existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel.<n>We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel.
arXiv Detail & Related papers (2025-06-23T18:49:23Z)
Esoteric Language Models [31.619674001793875]
We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms.<n>Eso-LMs set a new state of the art on standard language modeling benchmarks.<n>We are the **first to introduce KV caching for MDMs** while preserving parallel generation.
arXiv Detail & Related papers (2025-06-02T17:47:27Z)
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z)
Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD) UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework. It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.