Fugu-MT 論文翻訳(概要): SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

論文の概要: SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

arxiv url: http://arxiv.org/abs/2602.06283v1
Date: Fri, 06 Feb 2026 00:41:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-09 22:18:26.168139
Title: SOCKET: SOft Collison Kernel EsTimator for Sparse Attention
Title（参考訳）: SOCKET:Soft Collison Kernel EsTimator for Sparse Attention
Authors: Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava,
Abstract要約: 長期コンテキスト推論におけるスパシティの爆発は、大規模言語モデルのスケーリングの中心となる。 Locality-Sensitive Hashing (LSH) はスパシフィケーションプリミティブであり、確率的、類似性を認識したアグリゲーションに適合するハードバケットを置き換える。 SOCKETはSoft Collision EsTimatorで、ハードバケットのマッチを確率的、類似性を考慮したアグリゲーションに置き換える。
参考スコア（独自算出の注目度）: 25.278711498381494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top-$k$ tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at https://github.com/amarka8/SOCKET.
Abstract（参考訳）: 長期のコンテキスト推論におけるスパシティの爆発は、自己回帰的復号化のコストを抑えるため、大きな言語モデルのスケーリングの中心である。スパース・アテンションは、計算をトークンのサブセットに制限することで、このコストを削減するが、その効果は推論時の効率的なスコアリングと関連するトークンの選択に大きく依存する。我々は,LSH(Locality-Sensitive Hashing)をスパーシフィケーションプリミティブとして再検討し,ハードバケットマッチを確率的類似性認識アグリゲーションに置き換えるSOft Collision Kernel EsTimatorであるSOCKETを導入する。我々の重要な洞察は、ハードLSHは離散的な衝突信号を生成するため、ランク付けには適さないということである。対照的に、ソフトなLSHはハッシュテーブル全体の衝突証拠を格付けし、真の上位k$トークン間の相対順序の安定性を保っている。この変換により、LSHは候補世代のヒューリスティックから、スパースアテンションのために原理化され数学的に基底化されたスコアリングカーネルへと上昇する。この特性を活用して、SOCKETはアドホック投票機構を使わずに効率的なトークン選択を可能にし、様々なモデルを用いて複数の長文ベンチマークで確立された注意基準を一致または超える。キーをスコアリングするためのカスタムCUDAカーネルと、疎注意のためのFlash Decode Tritonバックエンドによって、SOCKETは、FlashAttentionよりも1.5$\times$高いスループットを実現している。コードはhttps://github.com/amarka8/SOCKETで公開されている。

論文の概要: SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

関連論文リスト