SOCKET: SOft Collison Kernel EsTimator for Sparse Attention
- URL: http://arxiv.org/abs/2602.06283v1
- Date: Fri, 06 Feb 2026 00:41:44 GMT
- Title: SOCKET: SOft Collison Kernel EsTimator for Sparse Attention
- Authors: Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava,
- Abstract summary: Exploiting sparsity during long-context inference is central to scaling large language models.<n>Locality-Sensitive Hashing (LSH) is a sparsification primitive and replaces hard bucket matches with probabilistic, similarity-aware aggregation.<n>SOCKET is a SOft Collision EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation.
- Score: 25.278711498381494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top-$k$ tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at https://github.com/amarka8/SOCKET.
Related papers
- Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks [87.16809558673403]
Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure.<n>We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs.
arXiv Detail & Related papers (2026-02-03T09:06:53Z) - FOCUS: DLLMs Know How to Tame Their Compute Bound [10.298643186738799]
FOCUS is an inference system designed for Diffusion Large Language Models (DLLMs)<n>It focuses computation on decodable tokens and evicting non-decodable ones on-the-fly.<n>It achieves up to 3.52$times$ throughput improvement over the production-grade engine LMM.
arXiv Detail & Related papers (2026-01-30T18:52:06Z) - Sparser Block-Sparse Attention via Token Permutation [46.22204775916057]
We propose Permuted Block-Sparse Attention (textbfPBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity.<n>Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75times$ in long-context prefilling.
arXiv Detail & Related papers (2025-10-24T09:11:50Z) - Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval [67.21678698740267]
We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys.<n>We also develop a lightweight, stable training framework using a Bradley-Terry ranking-based loss.
arXiv Detail & Related papers (2025-08-27T10:11:27Z) - Exploiting Discriminative Codebook Prior for Autoregressive Image Generation [54.14166700058777]
token-based autoregressive image generation systems first tokenize images into sequences of token indices with a codebook, and then model these sequences in an autoregressive paradigm.<n>While autoregressive generative models are trained only on index values, the prior encoded in the codebook, which contains rich token similarity information, is not exploited.<n>Recent studies have attempted to incorporate this prior by performing naive k-means clustering on the tokens, helping to facilitate the training of generative models with a reduced codebook.<n>We propose the Discriminative Codebook Prior Extractor (DCPE) as an alternative to k-means
arXiv Detail & Related papers (2025-08-14T15:00:00Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Local Sample-weighted Multiple Kernel Clustering with Consensus
Discriminative Graph [73.68184322526338]
Multiple kernel clustering (MKC) is committed to achieving optimal information fusion from a set of base kernels.
This paper proposes a novel local sample-weighted multiple kernel clustering model.
Experimental results demonstrate that our LSWMKC possesses better local manifold representation and outperforms existing kernel or graph-based clustering algo-rithms.
arXiv Detail & Related papers (2022-07-05T05:00:38Z) - Kernel k-Means, By All Means: Algorithms and Strong Consistency [21.013169939337583]
Kernel $k$ clustering is a powerful tool for unsupervised learning of non-linear data.
In this paper, we generalize results leveraging a general family of means to combat sub-optimal local solutions.
Our algorithm makes use of majorization-minimization (MM) to better solve this non-linear separation problem.
arXiv Detail & Related papers (2020-11-12T16:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.