Related papers: CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

URL: http://arxiv.org/abs/2601.02236v1
Date: Mon, 05 Jan 2026 16:09:22 GMT
Title: CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models
Authors: Yihao Liang, Ze Wang, Hao Chen, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Emad Barsoum, Zicheng Liu, Niraj K. Jha,
Abstract summary: CD4LM is a framework that decouples training from inference.<n>On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup.
Score: 27.070045950001532
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long-jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao-liang/CDLM

Related papers

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z)
T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization [45.026481622387244]
Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel.<n>We propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories.<n>Our approach consistently outperforms strong few-step baselines and standard training under tight step budgets.
arXiv Detail & Related papers (2026-02-12T18:52:35Z)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference [44.87788417755154]
We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention.<n>We show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups.
arXiv Detail & Related papers (2025-12-28T01:25:48Z)
CDLM: Consistency Diffusion Language Models For Faster Sampling [54.886467592798]
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference.<n>We introduce CDLM, a training-based acceleration method that simultaneously tackles both bottlenecks.<n>Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks.
arXiv Detail & Related papers (2025-11-24T16:21:25Z)
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z)
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits [37.06886078519443]
CreditDecoding is a training-free parallel decoding algorithm that accelerates the confidence convergence of correct but underconfident tokens.<n>On eight benchmarks, CreditDecoding achieves a 5.48 times speedup and a 0.48 performance improvement over LLaDA-8B-Instruct.
arXiv Detail & Related papers (2025-10-07T17:08:33Z)
Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z)
Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z)
Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs [57.69190972274813]
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models.<n>ExistingDLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation.<n>We introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding inDLLMs.
arXiv Detail & Related papers (2025-07-24T16:51:33Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.