Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models
- URL: http://arxiv.org/abs/2509.23653v1
- Date: Sun, 28 Sep 2025 05:39:49 GMT
- Title: Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models
- Authors: Zemin Huang, Yuhang Wang, Zhiyang Chen, Guo-Jun Qi,
- Abstract summary: RemeDi is a mask-based DLM that predicts token distributions and per-token confidence scores at each step.<n>We train a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens.<n>Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
- Score: 40.902681492117786
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
Related papers
- On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction [0.5097809301149341]
Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n.<n>Recent work shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass.<n>We study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints.
arXiv Detail & Related papers (2026-02-20T15:54:10Z) - Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models [17.18632315520133]
Masked Diffusion Language Models generate text by iteratively filling masked tokens.<n>Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state.<n>We train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts.
arXiv Detail & Related papers (2026-02-10T07:56:46Z) - Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models [46.151072011636444]
EvoToken-DLM is a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions.<n>EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines.
arXiv Detail & Related papers (2026-01-12T09:25:14Z) - Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z) - Soft-Masked Diffusion Language Models [35.191030145577145]
We introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens.<n>We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores.<n>We finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM.
arXiv Detail & Related papers (2025-10-20T06:42:03Z) - Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding [60.06816407728172]
Discrete diffusion language models have shown strong potential for text generation.<n>Standard supervised fine-tuning misaligns with semi-autoregressive inference.<n>We propose Blockwise SFT, which partitions responses into fixed-size blocks.
arXiv Detail & Related papers (2025-08-27T02:49:33Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions [32.48588058887852]
Insertion Language Models (ILMs) learn to insert tokens at arbitrary positions in a sequence.<n>ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences.
arXiv Detail & Related papers (2025-05-09T03:29:15Z) - Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text [27.320746607958142]
We propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme.<n>We exemplify our novel task-informed anti-curriculum by masking approach across three diverse downstream tasks.
arXiv Detail & Related papers (2025-02-18T15:36:16Z) - DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak [51.8218217407928]
Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs.<n>This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models.
arXiv Detail & Related papers (2024-12-23T12:44:54Z) - Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything.
We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters.
We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z) - MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer [158.06850125920923]
diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image.
We propose a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image.
Experimental results show that MDTv2 achieves superior image synthesis performance, e.g., a new SOTA FID score of 1.58 on the ImageNet dataset, and has more than 10x faster learning speed than the previous SOTA DiT.
arXiv Detail & Related papers (2023-03-25T07:47:21Z) - AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with
Masked Autoencoders [44.87786478095987]
Masked Autoencoders learn general representations for image, text, audio, video, etc., by masked input data from tokens of the visible data.
This paper proposes an adaptive masking strategy for MAEs that is end-to-end trainable.
AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network.
arXiv Detail & Related papers (2022-11-16T18:59:48Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Unsupervised Text Style Transfer with Padded Masked Language Models [25.397832729384064]
Masker is an unsupervised text-editing method for style transfer.
It performs competitively in a fully unsupervised setting.
It improves supervised methods' accuracy by over 10 percentage points in low-resource settings.
arXiv Detail & Related papers (2020-10-02T15:33:42Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.