Fugu-MT 論文翻訳(概要): Multipole Attention for Efficient Long Context Reasoning

論文の概要: Multipole Attention for Efficient Long Context Reasoning

arxiv url: http://arxiv.org/abs/2506.13059v1
Date: Mon, 16 Jun 2025 03:00:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-17 17:28:47.389941
Title: Multipole Attention for Efficient Long Context Reasoning
Title（参考訳）: 長所推論の効率化のための多極的注意法
Authors: Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami,
Abstract要約: 大規模推論モデル (LRM) は複雑な問題解決タスクにおいて有望な精度の向上を示す。 LRMは、答える前に考えるために、長い連鎖推論を生成する必要がある。本稿では,重要なトークンに対してのみ正確に注意を払うことで,自己回帰推論を高速化するマルチポール注意法を提案する。
参考スコア（独自算出の注目度）: 64.94673641704289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.
Abstract（参考訳）: 大規模推論モデル (LRM) は複雑な問題解決タスクにおいて有望な精度の向上を示す。これらのモデルは、テスト時に追加の計算を活用することで、高い精度を達成したが、答える前に考えるために、長い連鎖推論を生成する必要があるため、数千のトークンを生成する必要がある。この長い自己回帰推論によって引き起こされるKVキャッシュの圧力を低減できるが、これらの手法は推論過程を乱す誤りを起こす可能性がある。さらに、前処理では、しばしば入力を前処理することで、生成時に重要なプロンプトトークンを識別しやすくし、この前処理は新しく生成された推論トークンに対してオンラインで実行するのが困難である。本研究は,最も重要なトークンに対してのみ正確な注意を計算し,残ったトークンの近似表現を維持しながら,自己回帰推論を高速化するマルチポール注意を導入することで,これらの課題に対処する。提案手法はまず,意味論的に類似したキーベクトルをグループ化するためにクラスタリングを行い,そのクラスタセントロイドを用いて重要なキーベクトルを同定し,残りのキーベクトルを近似して高い精度を維持する。我々は、入力および以前に生成されたトークンを迅速に再クラスタ化する高速クラスタ更新プロセスを設計し、これにより、前の出力トークンに注意を向けることができる。提案手法はQwen-8B などの新しい LRM を用いて評価し,攻撃的な注意空間設定でも複雑な推論タスクの精度を維持可能であることを示した。また,本手法の実用的効率向上を実演するカーネル実装も提供し,長文推論アプリケーションにおける注目のために4.5$\times$ Speedupを実現した。私たちのコードはhttps://github.com/SqueezeAILab/MultipoleAttention.comで利用可能です。

論文の概要: Multipole Attention for Efficient Long Context Reasoning

関連論文リスト