Related papers: S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference

S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference

URL: http://arxiv.org/abs/2601.17702v2
Date: Wed, 28 Jan 2026 15:54:56 GMT
Title: S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
Authors: Qingsen Ma, Dianyun Wang, Yaoye Wang, Lechen Ning, Sujie Zhu, Xiaohang Zhang, Jiaming Lyu, Linhao Ren, Zhenbo Xu, Zhaofeng He,
Abstract summary: We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval.<n>S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders.<n>It constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan.
Score: 11.779449360037518
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.

Related papers

Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing [28.184704036272787]
Long contexts pose significant challenges for inference efficiency in large language models.<n>We propose CTKVR, a novel centroid-then-token KV retrieval scheme.<n>CTKVR achieves superior performance across multiple benchmarks with less than 1% accuracy degradation.
arXiv Detail & Related papers (2025-12-17T15:56:32Z)
Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z)
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache [67.47789629197857]
We propose a training-free framework that exploits the heterogeneous roles of transformer head dimensions.<n>By projecting the long-context-insensitive dimensions onto Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients.<n>We show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack.
arXiv Detail & Related papers (2025-06-13T15:35:54Z)
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding [7.142158555793151]
Large language models (LLMs) continue to support increasingly longer contexts.<n>Memory demand for key-value caches during decoding grows rapidly.<n>Sparse attention mechanisms alleviate this issue by computing attention weights only for selected key-value pairs.<n>Existing methods often treat each decoding step as an independent process.<n>We propose LFPS, an acceleration method that dynamically constructs sparse indexing candidates based on historical attention patterns.
arXiv Detail & Related papers (2025-05-30T02:35:59Z)
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference [27.69137902678418]
RetroInfer is a novel system that exploits the inherent attention sparsity to accelerate long-context inference.<n>We show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory.
arXiv Detail & Related papers (2025-05-05T18:01:17Z)
SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [31.766738294505767]
CacheGen is a fast context-loading module for large language models. Uses a custom tensor encoder to encode a KV cache into compact bitstream representations. adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth.
arXiv Detail & Related papers (2023-10-11T07:08:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.