Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models
- URL: http://arxiv.org/abs/2601.07351v2
- Date: Fri, 16 Jan 2026 06:24:27 GMT
- Title: Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models
- Authors: Linhao Zhong, Linyu Wu, Bozhen Fang, Tianjian Feng, Chenchen Jing, Wen Wang, Jiaheng Zhang, Hao Chen, Chunhua Shen,
- Abstract summary: EvoToken-DLM is a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions.<n>EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines.
- Score: 46.151072011636444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: https://aim-uofa.github.io/EvoTokenDLM.
Related papers
- Balancing Understanding and Generation in Discrete Diffusion Models [58.62235340638143]
Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization.<n>Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality.<n>We propose XDLM, which bridges the two paradigms via a stationary noise kernel.
arXiv Detail & Related papers (2026-02-01T18:00:35Z) - Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow [30.201913054064363]
Masked Diffusion Language Models promise parallel token generation and arbitrary-order decoding.<n>We characterize MDLM behavior along two dimensions -- parallelism strength and generation order.<n>We evaluate eight mainstream MDLMs on 58 benchmarks spanning knowledge, reasoning, and programming.
arXiv Detail & Related papers (2026-01-22T02:39:36Z) - Diffusion Language Models are Provably Optimal Parallel Samplers [15.981424915336001]
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models.<n>We show that DLMs augmented with a chain-of-thought can simulate any parallel sampling algorithm using an optimal number of sequential steps.
arXiv Detail & Related papers (2025-12-31T18:03:05Z) - D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs [22.78575203353886]
Diffusion-based multimodal large language models (Diffusion MLLMs) exhibit substantially slower inference than autoregressive models.<n>We propose D$3$ToM, a Decider-guided dynamic token merging method to accelerate inference in Diffusion MLLMs.<n>Experiments show that D$3$ToM accelerates inference while preserving competitive performance.
arXiv Detail & Related papers (2025-11-15T16:24:12Z) - Sequential Diffusion Language Models [110.06562906987052]
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value caches.<n>We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction.<n>We propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost.
arXiv Detail & Related papers (2025-09-28T17:59:15Z) - Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models [40.902681492117786]
RemeDi is a mask-based DLM that predicts token distributions and per-token confidence scores at each step.<n>We train a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens.<n>Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
arXiv Detail & Related papers (2025-09-28T05:39:49Z) - A Survey on Diffusion Language Models [30.00199970146068]
Diffusion Language Models (DLMs) are an alternative to the dominant autoregressive (AR) paradigm.<n>DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context.<n>Recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts.
arXiv Detail & Related papers (2025-08-14T17:47:22Z) - Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens [66.02261367232256]
Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation.<n>Existing approaches rely on spatial visual tokens, where image patches are encoded and arranged according to a spatial order.<n>In this paper, we build a proper visual language by reconstructing diffusion timesteps to learn discrete visual tokens.
arXiv Detail & Related papers (2025-04-20T16:14:28Z) - Large Language Diffusion Models [93.26422905620008]
Large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z) - Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video)<n>We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.