MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
- URL: http://arxiv.org/abs/2602.14209v1
- Date: Sun, 15 Feb 2026 16:07:51 GMT
- Title: MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM
- Authors: Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee,
- Abstract summary: Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings.<n>This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements.<n>MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup.<n>A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on
- Score: 9.69241599043101
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.
Related papers
- FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion [51.1618564189244]
FlashBlock is a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process.<n> Experiments on diffusion language models and video generation demonstrate up to 1.44$times$ higher token throughput and up to 1.6$times$ reduction in attention time, with negligible impact on generation quality.
arXiv Detail & Related papers (2026-02-05T04:57:21Z) - Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding [25.2227348401136]
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling.<n>We present textbfSDAR-VL, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding.<n>We show that SDAR-VL consistently improves emphtraining efficiency, emphconvergence stability, and emphtask performance over conventional block diffusion.
arXiv Detail & Related papers (2025-12-16T04:12:52Z) - Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs [17.499497967319332]
We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
arXiv Detail & Related papers (2025-10-28T16:34:18Z) - Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z) - SparseD: Sparse Attention for Diffusion Language Models [98.05780626106555]
diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs)<n>Existing open-source DLMs suffer from high inference latency.<n>We propose SparseD, a novel sparse attention method for DLMs.
arXiv Detail & Related papers (2025-09-28T18:10:10Z) - SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z) - Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.