S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
- URL: http://arxiv.org/abs/2601.17702v2
- Date: Wed, 28 Jan 2026 15:54:56 GMT
- Title: S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
- Authors: Qingsen Ma, Dianyun Wang, Yaoye Wang, Lechen Ning, Sujie Zhu, Xiaohang Zhang, Jiaming Lyu, Linhao Ren, Zhenbo Xu, Zhaofeng He,
- Abstract summary: We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval.<n>S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders.<n>It constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan.
- Score: 11.779449360037518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.
Related papers
- Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z) - SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z) - CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing [28.184704036272787]
Long contexts pose significant challenges for inference efficiency in large language models.<n>We propose CTKVR, a novel centroid-then-token KV retrieval scheme.<n>CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation.
arXiv Detail & Related papers (2025-12-17T15:56:32Z) - Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z) - Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z) - Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding [7.142158555793151]
Large language models (LLMs) continue to support increasingly longer contexts.<n>Memory demand for key-value caches during decoding grows rapidly.<n>Sparse attention mechanisms alleviate this issue by computing attention weights only for selected key-value pairs.<n>Existing methods often treat each decoding step as an independent process.<n>We propose LFPS, an acceleration method that dynamically constructs sparse indexing candidates based on historical attention patterns.
arXiv Detail & Related papers (2025-05-30T02:35:59Z) - RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference [27.69137902678418]
RetroInfer is a novel system that exploits the inherent attention sparsity to accelerate long-context inference.<n>We show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory.
arXiv Detail & Related papers (2025-05-05T18:01:17Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [31.766738294505767]
CacheGen is a fast context-loading module for large language models.
Uses a custom tensor encoder to encode a KV cache into compact bitstream representations.
adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth.
arXiv Detail & Related papers (2023-10-11T07:08:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.