Related papers: LUCID: Attention with Preconditioned Representations

LUCID: Attention with Preconditioned Representations

URL: http://arxiv.org/abs/2602.10410v1
Date: Wed, 11 Feb 2026 01:46:32 GMT
Title: LUCID: Attention with Preconditioned Representations
Authors: Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon,
Abstract summary: We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities.<n>This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space.<n>We validate our approach by training 1 billion parameter language models evaluated on up to 128K tokens.
Score: 14.98859684869003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID's preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

Related papers

FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z)
vAttention: Verified Sparse Attention [100.98210818821688]
vAttention is a practical sparse attention mechanism with user-specified $(epsilon, delta)$ guarantees on approximation accuracy (thus, verified)<n>We show that vAttention significantly improves the quality of sparse attention across datasets.<n>It can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality.
arXiv Detail & Related papers (2025-10-07T08:46:08Z)
Long-Context Generalization with Sparse Attention [21.400056571592277]
Transformer-based architectures traditionally employ softmax to compute attention weights.<n>As sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse.<n>We show that dynamically sparse attention mechanisms using $alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens.
arXiv Detail & Related papers (2025-06-19T22:43:25Z)
Lag-Relative Sparse Attention In Long Context Training [8.365610885641276]
We propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training.<n>Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window.
arXiv Detail & Related papers (2025-06-13T06:49:53Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Linear-Time User-Level DP-SCO via Robust Statistics [55.350093142673316]
User-level differentially private convex optimization (DP-SCO) has garnered significant attention due to the importance of safeguarding user privacy in machine learning applications.<n>Current methods, such as those based on differentially private gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility.<n>We introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges.
arXiv Detail & Related papers (2025-02-13T02:05:45Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.