Optimizing Mixture of Block Attention
- URL: http://arxiv.org/abs/2511.11571v1
- Date: Fri, 14 Nov 2025 18:59:59 GMT
- Title: Optimizing Mixture of Block Attention
- Authors: Guangxuan Xiao, Junxian Guo, Kasra Mazaheri, Song Han,
- Abstract summary: We develop a statistical model to analyze MoBA's underlying mechanics.<n>We identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals.<n>We introduce FlashMoBA, a hardware-aware kernel that enables efficient MoBA execution even with the small block sizes our theory recommends.
- Score: 12.276306440688137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA's performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA's underlying mechanics. Our model reveals that performance critically depends on the router's ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: https://github.com/mit-han-lab/flash-moba.
Related papers
- MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs [9.086910335841772]
"Memory wall" bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures.<n>We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach.<n>We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.
arXiv Detail & Related papers (2026-01-08T08:38:23Z) - SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations [54.303301888915406]
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost.<n>We propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching.<n>We also propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels.
arXiv Detail & Related papers (2025-12-16T04:39:10Z) - Block Sparse Flash Attention [29.499030734003952]
Block-Sparse FlashAttention is a drop-in replacement for FlashAttention.<n>It computes exact query-key similarities to select the top-k most important value blocks for each query.<n>It achieves up to 1.10x speedup on real-world reasoning benchmarks and up to 1.24x needle-in-a-haystack retrieval tasks.
arXiv Detail & Related papers (2025-12-07T21:20:12Z) - Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z) - BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z) - FlashMoE: Fast Distributed MoE in a Single Kernel [1.866526462692252]
FlashMoE is a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel.<n>We show that FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z) - MoBA: Mixture of Block Attention for Long-Context LLMs [46.10222520755179]
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI)<n>Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations.<n>We propose a solution that adheres to the less structure'' principle, allowing the model to determine where to attend autonomously.
arXiv Detail & Related papers (2025-02-18T14:06:05Z) - LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones [10.435069781620957]
Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks.
We analyze common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency.
We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer.
arXiv Detail & Related papers (2024-09-05T12:18:32Z) - Mamba YOLO: A Simple Baseline for Object Detection with State Space Model [10.44725284994877]
YOLO series has set a new benchmark for real-time object detectors.<n>Transformer-based structures have emerged as the most powerful solution.<n>However, the quadratic complexity of the self-attentive mechanism increases the computational burden.<n>We introduce a simple yet effective baseline approach called Mamba YOLO.
arXiv Detail & Related papers (2024-06-09T15:56:19Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.