Related papers: FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

URL: http://arxiv.org/abs/2506.01969v2
Date: Wed, 04 Jun 2025 03:20:26 GMT
Title: FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Authors: Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong,
Abstract summary: This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario.<n>ETAP reconfigures attention computation through transposition to align the KV context length with the (M)-dimension in WGMMA operations.<n>FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively.
Score: 2.9406946721643092
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the $M$-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE ($1.25 \times 10^{-5}$) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.

Related papers

FlashDMoE: Fast Distributed MoE in a Single Kernel [2.246222223318928]
FlashDMoE is a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a emphsingle persistent GPU kernel.<n>We show that FlashDMoE achieves up to textbf9$times$ higher GPU utilization, textbf6$times$ lower latency, textbf5.7$times$ higher throughput, and textbf4$times$ better overlap efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [9.386969461835433]
FlashInfer is a customizable and efficient attention engine for large language models (LLMs)<n>It tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy.<n>It also offers a customizable attention template, enabling adaptation to various settings through Just-In-TimeJIT compilation.
arXiv Detail & Related papers (2025-01-02T02:02:20Z)
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression [76.01465333271229]
multimodal large language models (MLLMs) behave like a sloth in practical use.<n>Recent efforts are devoted to building tiny MLLMs for better efficiency, but the plethora of visual tokens still used limit their actual speedup.<n>In this paper, we propose a powerful and fast tiny MLLM called FlashSloth.
arXiv Detail & Related papers (2024-12-05T16:34:07Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications. We develop three main techniques to speed up attention on GPUs. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library [0.7366405857677227]
We provide an optimized implementation of the forward pass of FlashAttention-2 as a custom fused kernel targeting NVIDIA Hopper architecture. We observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.
arXiv Detail & Related papers (2023-12-19T07:56:25Z)
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU. We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training. It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup. FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s. We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers. It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip. FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.