Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
- URL: http://arxiv.org/abs/2508.19529v1
- Date: Wed, 27 Aug 2025 02:49:33 GMT
- Title: Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
- Authors: Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang,
- Abstract summary: Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
- Score: 60.06816407728172
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models.
Related papers
- Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models [40.39823804602205]
Swordsman is an entropy-driven adaptive block-wise decoding framework for diffusion language models.<n>It partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries.<n>As a training-free framework, Swordsman demonstrates state-of-the-art performance across extensive evaluations.
arXiv Detail & Related papers (2026-02-04T10:27:49Z) - Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models [58.946955321428845]
This work presents self-rewarding sequential Monte Carlo (SMC)<n>Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy.<n>We introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights.
arXiv Detail & Related papers (2026-02-02T09:21:45Z) - From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z) - Soft-Masked Diffusion Language Models [35.191030145577145]
We introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens.<n>We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores.<n>We finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM.
arXiv Detail & Related papers (2025-10-20T06:42:03Z) - AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size [7.442463267121892]
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding.<n>This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding.<n>We introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime.
arXiv Detail & Related papers (2025-09-30T15:53:56Z) - Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z) - Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models [40.902681492117786]
RemeDi is a mask-based DLM that predicts token distributions and per-token confidence scores at each step.<n>We train a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens.<n>Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
arXiv Detail & Related papers (2025-09-28T05:39:49Z) - Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models [13.575063025878208]
Masked diffusion language models promise fast, non-autoregressive text generation.<n>Existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel.<n>We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel.
arXiv Detail & Related papers (2025-06-23T18:49:23Z) - DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak [51.8218217407928]
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs.<n>This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models.
arXiv Detail & Related papers (2024-12-23T12:44:54Z) - MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer [9.100416536151869]
Masked Generative Codec Transformer (MaskGCT) is a fully non-autoregressive text-to-speech model.
MaskGCT eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction.
Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems.
arXiv Detail & Related papers (2024-09-01T15:26:30Z) - Exploring and Improving Drafts in Blockwise Parallel Decoding [37.295672367973886]
Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models.
This paper contributes to the understanding and improvement of block drafts in two ways.
Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency.
arXiv Detail & Related papers (2024-04-14T11:49:38Z) - BASS: Block-wise Adaptation for Speech Summarization [47.518484305407185]
We develop a method that allows one to train summarization models on very long sequences in an incremental manner.
Speech summarization is realized as a streaming process, where hypothesis summaries are updated every block.
Experiments on the How2 dataset demonstrate that the proposed block-wise training method improves by 3 points absolute on ROUGE-L over a truncated input baseline.
arXiv Detail & Related papers (2023-07-17T03:31:36Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.