Related papers: From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

URL: http://arxiv.org/abs/2512.06776v1
Date: Sun, 07 Dec 2025 10:28:21 GMT
Title: From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
Authors: Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang,
Abstract summary: We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
Score: 58.640039233470766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.

Related papers

Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z)
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed [76.49335677120031]
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation.<n>We study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy.
arXiv Detail & Related papers (2025-12-16T04:12:17Z)
SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation [62.14510717860079]
We propose a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion.<n>SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation.<n>Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation.
arXiv Detail & Related papers (2025-10-07T17:29:28Z)
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size [7.442463267121892]
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding.<n>This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding.<n>We introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime.
arXiv Detail & Related papers (2025-09-30T15:53:56Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step [28.12392773921128]
Masked diffusion language models offer properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps.<n>A naive approach is to directly transfer techniques well-established for autoregressive (AR) language models to MDLMs.<n>We propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding schedulers, which unlock the potential of MDLMs to perform full diffusion-style decoding.
arXiv Detail & Related papers (2025-09-28T15:01:15Z)
Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z)
DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation [11.910667302899638]
DiffusionBlocks is a principled framework for transforming transformer-based networks into genuinely independent trainable blocks.<n>Our experiments on a range of transformer architectures demonstrate that DiffusionBlocks training matches the performance of end-to-end training.
arXiv Detail & Related papers (2025-06-17T05:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.