Related papers: Revealing the Attention Floating Mechanism in Masked Diffusion Models

Revealing the Attention Floating Mechanism in Masked Diffusion Models

URL: http://arxiv.org/abs/2601.07894v1
Date: Mon, 12 Jan 2026 09:10:05 GMT
Title: Revealing the Attention Floating Mechanism in Masked Diffusion Models
Authors: Xin Dai, Pengcheng Huang, Zhenghao Liu, Shuo Wang, Yukun Yan, Chaojun Xiao, Yu Gu, Ge Yu, Maosong Sun,
Abstract summary: Masked diffusion models (MDMs) leverage bidirectional attention and a denoising process.<n>This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating.
Score: 52.74142815156738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets are available at https://github.com/NEUIR/Attention-Floating.

Related papers

Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs [67.69730908817321]
Internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks.<n>We propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions.
arXiv Detail & Related papers (2026-02-17T13:08:06Z)
Robust Representation Learning in Masked Autoencoders [2.599882743586164]
Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood.<n>This work started as an attempt to understand the strong downstream classification performance of MAE.
arXiv Detail & Related papers (2026-02-03T13:48:34Z)
Attention Sinks in Diffusion Language Models [15.450369268824835]
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs)<n>We conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures.<n>Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour.
arXiv Detail & Related papers (2025-10-17T15:23:58Z)
MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding [24.731387422897644]
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities.<n>Modular Duplex Attention (MODA) simultaneously conducts the inner-modal refinement and inter-modal interaction.<n>Experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
arXiv Detail & Related papers (2025-07-07T03:37:42Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Towards Robust Semantic Segmentation against Patch-based Attack via Attention Refinement [68.31147013783387]
We observe that the attention mechanism is vulnerable to patch-based adversarial attacks. In this paper, we propose a Robust Attention Mechanism (RAM) to improve the robustness of the semantic segmentation model.
arXiv Detail & Related papers (2024-01-03T13:58:35Z)
Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z)
Calibrating Undisciplined Over-Smoothing in Transformer for Weakly Supervised Semantic Segmentation [51.14107156747967]
Weakly supervised semantic segmentation (WSSS) has attracted considerable attention because it requires fewer annotations than fully supervised approaches.<n>We propose an Adaptive Re-Activation Mechanism (AReAM) to control deep-level attention to undisciplined over-smoothing.<n>AReAM substantially improves segmentation performance compared with existing WSSS methods, reducing noise while sharpening focus on relevant semantic regions.
arXiv Detail & Related papers (2023-05-04T19:11:33Z)
Dynamic Scene Deblurring Base on Continuous Cross-Layer Attention Transmission [6.3482616879743885]
We introduce a new continuous cross-layer attention transmission (CCLAT) mechanism that can exploit hierarchical attention information from all the convolutional layers. Taking RDAFB as the building block, we design an effective architecture for dynamic scene deblurring named RDAFNet. Experiments on benchmark datasets show that the proposed model outperforms the state-of-the-art deblurring approaches.
arXiv Detail & Related papers (2022-06-23T04:55:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.