Causal Attention with Lookahead Keys
- URL: http://arxiv.org/abs/2509.07301v2
- Date: Mon, 29 Sep 2025 17:24:02 GMT
- Title: Causal Attention with Lookahead Keys
- Authors: Zhuoqing Song, Peng Sun, Huizhuo Yuan, Quanquan Gu,
- Abstract summary: In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context.<n>We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds.
- Score: 52.63961482746826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.
Related papers
- Decomposing Query-Key Feature Interactions Using Contrastive Covariances [75.38737409771085]
We study the query-key space -- the bilinear joint embedding space between queries and keys.<n>It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced.
arXiv Detail & Related papers (2026-02-04T16:50:02Z) - FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z) - CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation [7.119276797399788]
Increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency.<n>Most KV cache compression methods rely on token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs.<n>We introduce a layer-adaptive KV cache allocation strategy, which consistently outperforms state-of-the-art approaches under various memory budgets.
arXiv Detail & Related papers (2025-08-04T13:26:16Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models [14.013793473739236]
We propose Adaptive holistic attention KV (Aha KV) to address the bias of the accumulated attention score.<n>Aha KV successfully mitigates bias and retains crucial tokens across global context.
arXiv Detail & Related papers (2025-06-04T09:25:53Z) - Inference-time sparse attention with asymmetric indexing [23.305984099821618]
Self-attention in transformer models is an incremental associative memory that maps key vectors to value vectors.<n>One way to speed up self-attention is to employ GPU-compatible vector search algorithms based on standard partitioning methods such as k-means.<n>This paper introduces Saap, which overcomes these problems.<n>It is an asymmetrical indexing technique that employs distinct partitions for keys and queries, thereby approximating self-attention with a data-adaptive sparsity pattern.
arXiv Detail & Related papers (2025-02-12T09:39:54Z) - AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference [51.1972443343829]
We propose AttentionPredictor, which is the first learning-based critical token identification approach.<n> AttentionPredictor accurately predicts the attention score while consuming negligible memory.<n>We also propose a cross-token critical cache prefetching framework that hides the token time overhead to accelerate the decoding stage.
arXiv Detail & Related papers (2025-02-06T13:41:46Z) - ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
We propose ZETA to enable parallel querying of past tokens for entire sequences.<n>ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
arXiv Detail & Related papers (2025-01-24T15:33:05Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Representation Learning of Tangled Key-Value Sequence Data for Early Classification [19.943311002522154]
Key-value sequence data has become ubiquitous and naturally appears in a variety of real-world applications.
Classifying these key-value sequences is important in many scenarios such as user profiling and malicious applications identification.
In many time-sensitive scenarios, besides the requirement of classifying a key-value sequence accurately, it is also desired to classify a key-value sequence early.
arXiv Detail & Related papers (2024-04-11T03:23:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.