HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference
- URL: http://arxiv.org/abs/2602.00777v1
- Date: Sat, 31 Jan 2026 15:36:17 GMT
- Title: HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference
- Authors: Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang,
- Abstract summary: Long-context inference in Large Language Models is bottlenecked by the quadratic computation complexity of attention.<n>We introduce bf HyLRA, a novel framework driven by layer-wise sparsity profiling.<n>We show that HyLRA improves inference throughput by 6%--46% while maintaining comparable performance.
- Score: 11.718567830546538
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top-$k$ indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6\%--46\% while maintaining comparable performance (with $<1\%$ accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \href{https://anonymous.4open.science/r/unified-cache-management-CF80/}{\texttt{/r/unified-cache-management-CF80/}}
Related papers
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding [63.254758972725654]
Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency.<n>We introduce Progressive Thought, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches.
arXiv Detail & Related papers (2026-02-18T20:03:38Z) - Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping [61.459927600301654]
Multi-condition control is bottlenecked by the conventional concatenate-and-attend'' strategy.<n>Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant.<n>We propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies.
arXiv Detail & Related papers (2026-02-06T16:39:10Z) - RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference [13.524332723947703]
We present RRAttention, a novel dynamic sparse attention method.<n>It simultaneously achieves all desirable properties through a head underlineround-underlinerobin (RR) sampling strategy.<n>Our method reduces complexity from $O(L2)$ to $O(L2/S2)$ and employs adaptive Top-$$ selection for optimal sparsity.
arXiv Detail & Related papers (2026-02-05T16:37:41Z) - Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model [21.206033754351786]
Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens.<n>Existing approaches focus on token-wise optimization, leveraging diverse token pruning techniques to eliminate non-crucial visual tokens.<n>We propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns.
arXiv Detail & Related papers (2026-02-02T10:08:00Z) - Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference [9.469995152350899]
We propose Kascade, a training-free sparse attention method that leverages known observations.<n>Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers.<n>Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs.
arXiv Detail & Related papers (2025-12-18T10:37:14Z) - PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation [34.8993443618652]
We present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks.<n>Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity.<n>This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget.
arXiv Detail & Related papers (2025-12-03T18:02:11Z) - Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity [0.0]
Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs)<n>This paper examines the interplay between KV cache management strategies and the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct.
arXiv Detail & Related papers (2025-10-23T18:22:00Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens [47.60523011706102]
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency.<n>We propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries.
arXiv Detail & Related papers (2025-07-01T12:42:06Z) - On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention [53.22963042513293]
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs.<n>We first propose dual-state linear attention (A), a novel design that maintains two hidden states-one for preserving historical context and one for tracking recencythereby mitigating the short-range bias typical of linear-attention architectures.<n>We introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers DSLA layers at inference time, guided by a sensitivity-based layer ordering.
arXiv Detail & Related papers (2025-06-11T01:25:06Z) - PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.