FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
- URL: http://arxiv.org/abs/2509.25401v1
- Date: Mon, 29 Sep 2025 18:57:14 GMT
- Title: FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers
- Authors: Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, Hong An,
- Abstract summary: Flash Omni is a unified sparse attention engine compatible with arbitrary DiT architectures.<n>It delivers near-linear, closely matching the sparsity ratio speedup in attention and GEMM-$Q$, and achieves 2.5$times$-3.8$times$ acceleration in GEMM-$O$.
- Score: 7.026182341295719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$\times$-3.8$\times$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$\times$ end-to-end acceleration without degrading visual quality.
Related papers
- Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention [105.11288339285154]
Efficient-LVSM is a dual-stream architecture that applies intra-view self-attention for input views and self-then-cross attention for target views.<n>It achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed.
arXiv Detail & Related papers (2026-02-06T08:11:58Z) - ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding [37.86179431483446]
Autoregressive models (ARMs) are hindered by slow sequential inference.<n>We introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency.<n>ReFusion bridges the performance gap to strong ARMs while maintaining a 2.33$times$ average speedup.
arXiv Detail & Related papers (2025-12-15T17:41:19Z) - Flash Multi-Head Feed-Forward Network [51.82159978122374]
Multi-Head FFN (MH-FFN) is motivated by the structural similarity between single-head attention and FFN.<n>MH-FFN faces two challenges: memory consumption scaling with the head count, and an imbalanced ratio between the growing intermediate size and the fixed head dimension.<n>We propose Flash Multi-Head FFN (FlashMHF), with two key innovations: an I/O-aware fused kernel computing outputs online in akin to FlashAttention, and a design using dynamically weighted parallel sub-networks.
arXiv Detail & Related papers (2025-12-07T20:50:20Z) - BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination [14.53308613746613]
BitStopper is a fine-grained algorithm-architecture co-design that operates without a sparsity predictor.<n>It achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.
arXiv Detail & Related papers (2025-12-06T14:44:38Z) - BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z) - Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z) - Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers [24.105473321347894]
We propose Sparse-vDiT, a sparsity acceleration framework for Video Diffusion Transformer (vDiT)<n>We show that Sparse-vDiT achieves 2.09$times$, 2.38$times$, and 1.67$times$ theoretical FLOP reduction, and actual inference speedups of 1.76$times$, 1.85$times$, and 1.58$times$, respectively.<n>Our work demonstrates that latent structural sparsity in vDiTs can be systematically exploited for long video synthesis.
arXiv Detail & Related papers (2025-06-03T16:42:37Z) - VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers [13.984340807378457]
Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step.<n>We design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method.<n>We execute Softmax with 162.7$times$ less latency and 74.3$times$ less energy compared to the baseline cluster.
arXiv Detail & Related papers (2025-04-15T14:28:48Z) - SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts [2.200751835496112]
Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention.<n>We introduce SPECTRE, a method that replaces each attention head with a fast real FFT.<n>We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module.
arXiv Detail & Related papers (2025-02-25T17:43:43Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.