Related papers: CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

URL: http://arxiv.org/abs/2602.16054v1
Date: Tue, 17 Feb 2026 22:08:16 GMT
Title: CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Authors: Bradley McDanel, Steven Li, Harshit Khaitan,
Abstract summary: We introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt.<n>This oracle reveals that existing oracles exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks.<n>We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.
Score: 4.440373965918973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.

Related papers

FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z)
A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification [2.0069888187253615]
Production LLM systems often rely on separate models for safety and other classification-heavy steps.<n>We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation.
arXiv Detail & Related papers (2026-01-19T18:40:29Z)
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z)
Intra-request branch orchestration for efficient LLM reasoning [52.68946975865865]
Large Language Models (LLMs) increasingly rely on inference-time reasoning algorithms to improve accuracy on complex tasks.<n>Prior work has largely focused on reducing token usage, often at the expense of accuracy, while overlooking other latency factors.<n>We present DUCHESS, an LLM serving system that reduces cost and latency without sacrificing accuracy through intra-request branch orchestration guided by predictions.
arXiv Detail & Related papers (2025-09-29T15:52:08Z)
CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation [7.119276797399788]
Increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency.<n>Most KV cache compression methods rely on token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs.<n>We introduce a layer-adaptive KV cache allocation strategy, which consistently outperforms state-of-the-art approaches under various memory budgets.
arXiv Detail & Related papers (2025-08-04T13:26:16Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks.<n>They are often distracted by irrelevant or noisy context in input sequences that degrades output quality.<n>We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z)
Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video. Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z)
SoQal: Selective Oracle Questioning for Consistency Based Active Learning of Cardiac Signals [17.58391771585294]
Clinical settings are often characterized by abundant unlabelled data and limited labelled data. One way to mitigate this burden is via active learning (AL) which involves the (a) acquisition and (b) annotation of informative unlabelled instances. We show that BALC can outperform start-of-the-art acquisition functions such as BALD, and SoQal outperforms baseline methods even in the presence of a noisy oracle.
arXiv Detail & Related papers (2020-04-20T18:20:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.