Related papers: FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

URL: http://arxiv.org/abs/2602.05305v2
Date: Fri, 06 Feb 2026 17:20:17 GMT
Title: FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Authors: Zhuokun Chen, Jianfei Cai, Bohan Zhuang,
Abstract summary: FlashBlock is a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process.<n> Experiments on diffusion language models and video generation demonstrate up to 1.44$times$ higher token throughput and up to 1.6$times$ reduction in attention time, with negligible impact on generation quality.
Score: 51.1618564189244
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.

Related papers

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM [9.69241599043101]
Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings.<n>This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements.<n>MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup.<n>A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on
arXiv Detail & Related papers (2026-02-15T16:07:51Z)
Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention [37.91838955436801]
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines.<n>As generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency.<n>We propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor matching; and AnnSA sparsifies self-attention by restricting each query
arXiv Detail & Related papers (2026-02-02T08:31:21Z)
Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z)
VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding [52.69880888587866]
Current Video Large Language Models (Video LLMs) typically encode frames via a encoder vision and employ an autoregressive (AR) LLM for understanding and generation.<n>We propose VidLaDA, a Diffusion Video LLM based on Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive modeling and decode tokens in parallel.<n>Experiments show VidLaDA rivals state-of-the-art AR baselines and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy.
arXiv Detail & Related papers (2026-01-25T15:02:01Z)
From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs [58.640039233470766]
We show that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch.<n> NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2025-12-07T10:28:21Z)
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation [44.45173635133032]
BlockVid is a novel block diffusion framework equipped with semantic-aware sparse KV cache.<n> LV-Bench is a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence.
arXiv Detail & Related papers (2025-11-28T08:25:59Z)
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching [6.354675628412448]
Block-Wise Caching (BWCache) is a training-free method to accelerate DiT-based video generation.<n> experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$times$ speedup with comparable visual quality.
arXiv Detail & Related papers (2025-09-17T07:58:36Z)
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models [15.853201399662344]
Diffusion language models offer unique benefits over autoregressive models.<n>They lag in likelihood modeling and are limited to fixed-length generation.<n>We introduce a class of block diffusion language models that interpolate between discrete denoising diffusion and autoregressive models.
arXiv Detail & Related papers (2025-03-12T17:43:40Z)
Generalized Interpolating Discrete Diffusion [65.74168524007484]
Masked diffusion is a popular choice due to its simplicity and effectiveness.<n>We generalize a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes.<n>Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality.
arXiv Detail & Related papers (2025-03-06T14:30:55Z)
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
ACDiT is a blockwise Conditional Diffusion Transformer.<n>It offers a flexible between token-wise autoregression and full-sequence diffusion.<n>We show that ACDiT performs best among all autoregressive baselines on image and video generation tasks.
arXiv Detail & Related papers (2024-12-10T18:13:20Z)
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio. We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.