Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
- URL: http://arxiv.org/abs/2510.17896v1
- Date: Sun, 19 Oct 2025 07:07:37 GMT
- Title: Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism
- Authors: Tao Bu, Qiangang Wang, Bowen Zeng, Hanwen Sun, Yunpeng Huang, Chun Cao, Jingwei Xu,
- Abstract summary: Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length.<n>Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, which scale attention across multiple devices.<n>We propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular training interface for evaluation.
- Score: 5.744937422147067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.
Related papers
- Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - Efficient Attention Mechanisms for Large Language Models: A Survey [18.86171225316892]
Transformer-based architectures have become the prevailing computation backbone of large language models.<n>Recent research has introduced two principal categories of efficient attention mechanisms.<n>Sparse attention techniques, in contrast, limit attention to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies.
arXiv Detail & Related papers (2025-07-25T18:08:10Z) - Modality Agnostic Efficient Long Range Encoder [14.705955027331674]
We address the challenge of long-context processing on a single device using generic implementations.<n>To overcome these limitations, we propose MAELRE, a unified and efficient transformer architecture.<n>We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models.
arXiv Detail & Related papers (2025-07-25T16:19:47Z) - Scaling Linear Attention with Sparse State Expansion [62.749291436866606]
Transformer architecture struggles with long-context scenarios due to quadratic computation and linear memory growth.<n>We propose two key innovations to achieve more effective context compression.<n>First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification.<n>Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions.
arXiv Detail & Related papers (2025-07-22T13:27:31Z) - Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models [49.84163262868945]
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling.<n>The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers.<n>We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
arXiv Detail & Related papers (2024-12-21T09:04:51Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z) - S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.<n>It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.<n>This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - Exploring Techniques for the Analysis of Spontaneous Asynchronicity in
MPI-Parallel Applications [0.8889304968879161]
We run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms.
We show how desynchronization patterns can be readily identified from a data set that is much smaller than a full MPI trace.
arXiv Detail & Related papers (2022-05-27T13:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.