Fugu-MT 論文翻訳(概要): Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

論文の概要: Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

arxiv url: http://arxiv.org/abs/2511.09596v1
Date: Fri, 14 Nov 2025 01:01:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.359754
Title: Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off
Title（参考訳）: スピードとパフォーマンスのトレードオフのない、まともなアテンション
Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu,
Abstract要約: 既存のスパース手法は、しばしば計算効率のために情報の整合性を交換する。我々はSPAttentionを提案し、その中心となる貢献は、原則的構造スパーシリティ(Principled Structure Sparsity)という新しいパラダイムの導入である。 SPAttentionは、全注目作業負荷をバランスの取れた非重なり合う距離バンドに再編成し、各ヘッドにユニークなセグメントを割り当てる。
参考スコア（独自算出の注目度）: 20.259111403684006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.
Abstract（参考訳）: その顕著な表現力は、コンテキストサイズ(N$)で2倍に成長し、ヘッド数(H$)で線形に成長する$O(H \cdot N^2)$の計算複雑性の上に構築されている。この標準実装は、全てのヘッドが同じシーケンス空間上で独立に注意を計算するため、大きな計算冗長性を持つ。一方、既存のスパース手法は、しばしば計算効率のために情報整合性を交換する。この効率性と性能のトレードオフを解決するため、我々はSPAttentionを提案し、その中心となる貢献は、原則的構造スパーシティー(Principled Structure Sparsity)という新しいパラダイムの導入である。 SPAttentionは単に接続をドロップするだけでなく、全アテンションワークロードをバランスの取れた非重なり合う距離バンドに分割して計算タスクを再編成し、それぞれのヘッドにユニークなセグメントを割り当てる。このアプローチは、マルチヘッドアテンション機構を、$H$独立$O(N^2)$計算から1つの協調$O(N^2)$計算に変換する。構造的帰納バイアスは、ヘッド間の機能的特殊化を補完し、冗長なモデリングからシーケンス全体の異なる依存関係への計算資源のより効率的な割り当てを可能にする。 OLMoE-1B-7B と 0.25B-1.75B モデルシリーズの大規模な実証検証では、トレーニングスループットが約2倍に向上する一方で、その性能は標準的な集中度に匹敵するものであり、選択された主要な指標においてさえ上回っており、ロングフォーマー、リフォーマー、ビッグバードなどの代表的注意法を一貫して上回っている。

論文の概要: Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

関連論文リスト