Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
- URL: http://arxiv.org/abs/2511.21338v1
- Date: Wed, 26 Nov 2025 12:44:29 GMT
- Title: Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
- Authors: Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos,
- Abstract summary: Masked Diffusion Language Models have emerged as a promising alternative to Autoregressive Language Models.<n>We show that MDLMs exhibit a strong locality bias, favouring local over distant context.<n>We introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks.
- Score: 19.847438086389616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.
Related papers
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs [100.02824137397464]
We investigate how Large Language Models adapt their internal representations when encountering inputs of increasing difficulty.<n>We reveal a consistent and quantifiable phenomenon: as task difficulty increases, the last hidden states of LLMs become substantially sparser.<n>This sparsity--difficulty relation is observable across diverse models and domains.
arXiv Detail & Related papers (2026-03-03T18:48:15Z) - Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation [51.743225614196774]
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning.<n>They remain vulnerable to hallucination, where generated content deviates from visual evidence.<n>Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding.<n>We propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs.
arXiv Detail & Related papers (2026-02-27T14:18:51Z) - Relaxing Positional Alignment in Masked Diffusion Language Models [6.511565218210195]
Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches.<n>We show that strict positional prediction makes MDLM decoding highly sensitive to token misalignment.<n>We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks.
arXiv Detail & Related papers (2026-01-30T13:09:21Z) - Soft-Masked Diffusion Language Models [35.191030145577145]
We introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens.<n>We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores.<n>We finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM.
arXiv Detail & Related papers (2025-10-20T06:42:03Z) - Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z) - Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks [50.53590930588431]
adversarial examples pose serious threats to natural language processing systems.<n>Recent studies suggest that adversarial texts deviate from the underlying manifold of normal texts, whereas masked language models can approximate the manifold of normal data.<n>We first introduce Masked Language Model-based Detection (MLMD), leveraging mask unmask operations of the masked language modeling (MLM) objective.
arXiv Detail & Related papers (2025-04-08T14:10:57Z) - Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More [26.226145789963443]
Mask-Enhanced Autoregressive Prediction (MEAP) is a training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP)<n>In intensive experiments, MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks.<n>Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens.
arXiv Detail & Related papers (2025-02-11T11:49:03Z) - ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models [11.997499811414837]
Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
arXiv Detail & Related papers (2025-01-23T05:46:50Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.