Fugu-MT 論文翻訳(概要): Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

論文の概要: Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

arxiv url: http://arxiv.org/abs/2510.18413v1
Date: Tue, 21 Oct 2025 08:44:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.159338
Title: Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
Title（参考訳）: Adamas氏: 効率的なロングコンテキスト推論のためのアダマールスパース注意
Authors: Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu,
Abstract要約: 我々は,長文推論用に設計された軽量かつ高精度なスパースアテンション機構であるAdamasを紹介する。実験の結果、アダガスは64段階の予算しか持たず、128倍の性能で、従来のSOTA(State-of-the-art)の手法よりも最大8倍高い空間性をサポートすることがわかった。
参考スコア（独自算出の注目度）: 15.466168180222164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
Abstract（参考訳）: 大規模言語モデル(LLM)は、数十万から数百万のトークンのコンテキストウィンドウをサポートし、長期文書要約、大規模コード合成、複数文書質問応答、永続的マルチターン対話などのアプリケーションを可能にする。しかし、このような拡張されたコンテキストは、自己アテンションの二次的なコストを悪化させ、自己回帰的デコードに深刻な遅延をもたらす。既存のスパースアテンション手法はこれらのコストを軽減するが、クエリ毎にキーバリュー(KV)ペアをリコールするのに苦労するヒューリスティックなパターンに依存しており、精度が低下する。我々は,長文推論用に設計された軽量かつ高精度なスパースアテンション機構であるAdamasを紹介する。アダマスは、コンパクトな表現を生成するためにアダマール変換、バケット化、2ビット圧縮を適用し、マンハッタン距離推定を利用して効率的なトップk選択を行う。実験の結果、Adamasは64tokenの予算で完全注意の精度と一致し、128でほぼロスレスのパフォーマンスを達成し、32Kのシーケンスで最大4.4倍のセルフアテンションと1.5倍のエンドツーエンドのスピードアップを実現しながら、従来のSOTA(State-of-the-art)手法よりも最大8倍高いスパシティをサポートすることがわかった。注目すべきは、Adamasは完全な注意力よりも同等またはより低いパープレキシティを獲得し、攻撃的なスパーシリティの下での精度維持の有効性を強調していることである。

論文の概要: Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

関連論文リスト