Related papers: One Token Is Enough: Improving Diffusion Language Models with a Sink Token

One Token Is Enough: Improving Diffusion Language Models with a Sink Token

URL: http://arxiv.org/abs/2601.19657v2
Date: Thu, 29 Jan 2026 15:35:27 GMT
Title: One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Authors: Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Shaosheng Cao,
Abstract summary: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches.<n>There is a critical instability in DLMs: the moving sink phenomenon.<n>We propose a simple but effective extra sink token implemented via a modified attention mask.
Score: 9.076240488230274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer's value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

Related papers

Same Answer, Different Representations: Hidden instability in VLMs [65.36933543377346]
We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness.<n>We apply this framework to modern Vision Language Models (VLMs) across the SEEDBench, MMMU, and POPE datasets.
arXiv Detail & Related papers (2026-02-06T12:24:26Z)
Reasoning with Latent Tokens in Diffusion Language Models [47.27454676014286]
We show that diffusion models are trained to jointly predict a distribution over all unknown tokens, including those that will not actually be decoded in the current step.<n>We demonstrate that latent tokens can be introduced into autoregressive models through an auxiliary multi-token prediction objective.<n>Our results suggest that latent tokens, while arising naturally in diffusion, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead.
arXiv Detail & Related papers (2026-02-03T17:27:46Z)
Relaxing Positional Alignment in Masked Diffusion Language Models [6.511565218210195]
Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches.<n>We show that strict positional prediction makes MDLM decoding highly sensitive to token misalignment.<n>We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks.
arXiv Detail & Related papers (2026-01-30T13:09:21Z)
Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z)
D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning [49.16227597771663]
D2Pruner is a framework that combines debiased importance with a structural pruning mechanism.<n>It reduces FLOPs by 74.2% while retaining 99.2% of its original performance.<n>It marks a significant advancement with up to 63. 53% improvement over existing methods.
arXiv Detail & Related papers (2025-12-22T14:42:31Z)
Learning Unmasking Policies for Diffusion Language Models [33.44995119635116]
Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks.<n>One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary.<n>In this work, we propose to train sampling procedures using reinforcement learning.
arXiv Detail & Related papers (2025-12-09T20:44:33Z)
Attention Sinks in Diffusion Language Models [15.450369268824835]
Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs)<n>We conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures.<n>Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour.
arXiv Detail & Related papers (2025-10-17T15:23:58Z)
Continuously Augmented Discrete Diffusion model for Categorical Generative Modeling [87.34677262370924]
Standard discrete diffusion models treat all unobserved states identically by mapping them to an absorbing [MASK] token.<n>This creates an 'information void' where semantic information that could be inferred from unmasked tokens is lost between denoising steps.<n>We introduce Continuously Augmented Discrete Diffusion, a framework that augments the discrete state space with a paired diffusion in a continuous latent space.
arXiv Detail & Related papers (2025-10-01T18:00:56Z)
An Ensemble Framework for Unbiased Language Model Watermarking [60.99969104552168]
We propose ENS, a novel ensemble framework that enhances the detectability and robustness of unbiased watermarks.<n>ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal.<n> Empirical evaluations show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks.
arXiv Detail & Related papers (2025-09-28T19:37:44Z)
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs [54.229363096087866]
Speech tokenizers are not robust to meaning-irrelevant acoustic perturbations.<n>This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal.<n>We introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism.
arXiv Detail & Related papers (2025-09-26T11:32:51Z)
Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Attention Sinks: A 'Catch, Tag, Release' Mechanism for Embeddings [16.950215926321558]
Large language models (LLMs) often concentrate their attention on a few specific tokens referred to as attention sinks.<n>Common examples include the first token, a prompt-independent sink, and punctuation tokens.<n>Despite their ubiquity, the function, semantic role, and origin of attention sinks remain poorly understood.
arXiv Detail & Related papers (2025-02-02T21:15:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.