Related papers: FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

URL: http://arxiv.org/abs/2509.25401v1
Date: Mon, 29 Sep 2025 18:57:14 GMT
Title: FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
Authors: Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, Hong An,
Abstract summary: Flash Omni is a unified sparse attention engine compatible with arbitrary DiT architectures.<n>It delivers near-linear, closely matching the sparsity ratio speedup in attention and GEMM-$Q$, and achieves 2.5$times$-3.8$times$ acceleration in GEMM-$O$.
Score: 7.026182341295719
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end acceleration without degrading visual quality.

Related papers

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention [105.11288339285154]
Efficient-LVSM is a dual-stream architecture that applies intra-view self-attention for input views and self-then-cross attention for target views.<n>It achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed.
arXiv Detail & Related papers (2026-02-06T08:11:58Z)
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding [37.86179431483446]
Autoregressive models (ARMs) are hindered by slow sequential inference.<n>We introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency.<n>ReFusion bridges the performance gap to strong ARMs while maintaining a 2.33$times$ average speedup.
arXiv Detail & Related papers (2025-12-15T17:41:19Z)
Flash Multi-Head Feed-Forward Network [51.82159978122374]
Multi-Head FFN (MH-FFN) is motivated by the structural similarity between single-head attention and FFN.<n>MH-FFN faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension.<n>We propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in akin to FlashAttention, and a design using dynamically weighted parallel sub-networks.
arXiv Detail & Related papers (2025-12-07T20:50:20Z)
BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination [14.53308613746613]
BitStopper is a fine-grained algorithm-architecture co-design that operates without a sparsity predictor.<n>It achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
arXiv Detail & Related papers (2025-12-06T14:44:38Z)
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z)
Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z)
Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z)
VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers [13.984340807378457]
Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.<n>We design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method.<n>We execute Softmax with 162.7$times$ less latency and 74.3$times$ less energy compared to the baseline cluster.
arXiv Detail & Related papers (2025-04-15T14:28:48Z)
SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts [2.200751835496112]
Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention.<n>We introduce SPECTRE, a method that replaces each attention head with a fast real FFT.<n>We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module.
arXiv Detail & Related papers (2025-02-25T17:43:43Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks. The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources. This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.