CDLM: Consistency Diffusion Language Models For Faster Sampling
- URL: http://arxiv.org/abs/2511.19269v1
- Date: Mon, 24 Nov 2025 16:21:25 GMT
- Title: CDLM: Consistency Diffusion Language Models For Faster Sampling
- Authors: Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun, Ce Zhang, Kurt Keutzer, Amir Gholami,
- Abstract summary: Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference.<n>We introduce CDLM, a training-based acceleration method that simultaneously tackles both bottlenecks.<n>Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks.
- Score: 54.886467592798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
Related papers
- Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding [36.74241893088594]
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation.<n>Recent works have accelerated inference via KV cache reuse or decoding, but overlook the intrinsic inefficiencies within the block-wise diffusion process.<n>We propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions.
arXiv Detail & Related papers (2026-01-25T17:36:04Z) - CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models [27.070045950001532]
CD4LM is a framework that decouples training from inference.<n>On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup.
arXiv Detail & Related papers (2026-01-05T16:09:22Z) - dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z) - Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z) - Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.