Related papers: Attention Sinks in Diffusion Language Models

Attention Sinks in Diffusion Language Models

URL: http://arxiv.org/abs/2510.15731v1
Date: Fri, 17 Oct 2025 15:23:58 GMT
Title: Attention Sinks in Diffusion Language Models
Authors: Maximo Eduardo Rulli, Simone Petruzzi, Edoardo Michielon, Fabrizio Silvestri, Simone Scardapane, Alessio Devoto,
Abstract summary: Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs)<n>We conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures.<n>Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour.
Score: 15.450369268824835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.

Related papers

Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection [52.5174167737992]
Video anomaly detection (VAD) aims to identify abnormal events in videos.<n>We propose SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations.<n>Our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data.
arXiv Detail & Related papers (2026-02-27T13:48:50Z)
DLLM Agent: See Farther, Run Faster [94.74432470237817]
Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties.<n>We study this in a controlled setting by instantiatingDLLM and AR backbones within the same agent workflow.<n>We find thatDLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup.
arXiv Detail & Related papers (2026-02-07T09:01:18Z)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders [73.18745837755758]
We present DLM-Scope, the first SAE-based interpretability framework for diffusion language models.<n>We show that trained Top-K SAEs can faithfully extract interpretable features.<n>We also show a great potential of applying SAEs to DLM-related tasks and algorithms.
arXiv Detail & Related papers (2026-02-05T16:41:25Z)
Relaxing Positional Alignment in Masked Diffusion Language Models [6.511565218210195]
Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches.<n>We show that strict positional prediction makes MDLM decoding highly sensitive to token misalignment.<n>We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks.
arXiv Detail & Related papers (2026-01-30T13:09:21Z)
One Token Is Enough: Improving Diffusion Language Models with a Sink Token [9.076240488230274]
Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches.<n>There is a critical instability in DLMs: the moving sink phenomenon.<n>We propose a simple but effective extra sink token implemented via a modified attention mask.
arXiv Detail & Related papers (2026-01-27T14:32:36Z)
Revealing the Attention Floating Mechanism in Masked Diffusion Models [52.74142815156738]
Masked diffusion models (MDMs) leverage bidirectional attention and a denoising process.<n>This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating.
arXiv Detail & Related papers (2026-01-12T09:10:05Z)
Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models [82.87985794856803]
Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks.<n>Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture.
arXiv Detail & Related papers (2025-10-05T10:50:52Z)
A Survey on Diffusion Language Models [30.00199970146068]
Diffusion Language Models (DLMs) are an alternative to the dominant autoregressive (AR) paradigm.<n>DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context.<n>Recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts.
arXiv Detail & Related papers (2025-08-14T17:47:22Z)
Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Probing Large Language Models from A Human Behavioral Perspective [24.109080140701188]
Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. The understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA) remains largely unexplored.
arXiv Detail & Related papers (2023-10-08T16:16:21Z)
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z)
David helps Goliath: Inference-Time Collaboration Between Small Specialized and Large General Diffusion LMs [49.822063966687175]
Diffusion-based language models are emerging as a promising alternative to autoregressive LMs. We propose methods to scale a recently proposed diffusion model SSD-LM from 0.4B to 13B parameters. We show that SSD-2 facilitates novel ensembles with 100x smaller models that can be customized and deployed by individual users.
arXiv Detail & Related papers (2023-05-24T06:22:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.