Soft-Masked Diffusion Language Models
- URL: http://arxiv.org/abs/2510.17206v1
- Date: Mon, 20 Oct 2025 06:42:03 GMT
- Title: Soft-Masked Diffusion Language Models
- Authors: Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, Abbas Rahimi,
- Abstract summary: We introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens.<n>We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores.<n>We finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM.
- Score: 35.191030145577145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.
Related papers
- MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation [78.75809158246723]
We present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional and supports efficient parallel multi-token generation.<n>We also introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-Hearing, and 3D-space objectives.<n>MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%.
arXiv Detail & Related papers (2026-01-27T13:06:47Z) - Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z) - Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models [40.902681492117786]
RemeDi is a mask-based DLM that predicts token distributions and per-token confidence scores at each step.<n>We train a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens.<n>Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
arXiv Detail & Related papers (2025-09-28T05:39:49Z) - Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking [28.55159825491572]
Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence.<n>We propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states between the masked and unmasked states.<n>Our method demonstrates superior performance across a diverse set of generative modeling tasks.
arXiv Detail & Related papers (2025-05-24T04:16:40Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.