VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
- URL: http://arxiv.org/abs/2511.10232v1
- Date: Fri, 14 Nov 2025 01:40:39 GMT
- Title: VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction
- Authors: Yuhao Wang, Ziyang Cheng, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang,
- Abstract summary: VocalNet-M2 is a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction strategy.<n>Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model.
- Score: 31.58493743596625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current end-to-end spoken language models (SLMs) have made notable progress, yet they still encounter considerable response latency. This delay primarily arises from the autoregressive generation of speech tokens and the reliance on complex flow-matching models for speech synthesis. To overcome this, we introduce VocalNet-M2, a novel low-latency SLM that integrates a multi-codebook tokenizer and a multi-token prediction (MTP) strategy. Our model directly generates multi-codebook speech tokens, thus eliminating the need for a latency-inducing flow-matching model. Furthermore, our MTP strategy enhances generation efficiency and improves overall performance. Extensive experiments demonstrate that VocalNet-M2 achieves a substantial reduction in first chunk latency (from approximately 725ms to 350ms) while maintaining competitive performance across mainstream SLMs. This work also provides a comprehensive comparison of single-codebook and multi-codebook strategies, offering valuable insights for developing efficient and high-performance SLMs for real-time interactive applications.
Related papers
- MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation [78.75809158246723]
We present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional and supports efficient parallel multi-token generation.<n>We also introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-Hearing, and 3D-space objectives.<n>MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%.
arXiv Detail & Related papers (2026-01-27T13:06:47Z) - Efficient Multi-modal Long Context Learning for Training-free Adaptation [96.21248144937627]
This paper introduces Efficient Multi-Modal Long Context Learning (EMLoC)<n>It embeds demonstration examples directly into the model input.<n>It condenses long-context multimodal inputs into compact, task-specific memory representations.
arXiv Detail & Related papers (2025-05-26T10:49:44Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks [41.04727840852988]
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds.<n>This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text.<n>We propose textbfFLASH (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs.
arXiv Detail & Related papers (2025-05-19T05:35:30Z) - VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation [26.34810950257782]
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing.<n>We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework.<n>Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs.
arXiv Detail & Related papers (2025-04-05T04:57:12Z) - CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models [74.80386066714229]
We present an improved streaming speech synthesis model, CosyVoice 2.<n>Specifically, we introduce finite-scalar quantization to improve codebook utilization of speech tokens.<n>We develop a chunk-aware causal flow matching model to support various synthesis scenarios.
arXiv Detail & Related papers (2024-12-13T12:59:39Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition [15.204703947024242]
PaDeLLM-NER allows for the simultaneous decoding of all mentions, thereby reducing generation latency.
Experiments reveal that PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times faster than the autoregressive approach for both English and Chinese.
arXiv Detail & Related papers (2024-02-07T13:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.