Fugu-MT 論文翻訳(概要): BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

論文の概要: BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

arxiv url: http://arxiv.org/abs/2603.09582v1
Date: Tue, 10 Mar 2026 12:31:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.304129
Title: BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
Title（参考訳）: バイナリアテンション:視覚・拡散変換器用1ビットQKアテンション
Authors: Chaodong Xiao, Zhengqiang Zhang, Lei Zhang,
Abstract要約: 注意のバイナライゼーションは,本質的な類似性関係を保ち,バイナリアテンションを提案する。学習可能なバイアスを組み込むことで1ビット量子化の下での固有情報損失を軽減し、エンドツーエンドの加速を可能にする。我々の研究は、低ビットビジョンと拡散トランスフォーマーのフロンティアを推し進め、完全精度の注意に対する高効率で効果的な代替手段を提供する。
参考スコア（独自算出の注目度）: 13.600791786470841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.
Abstract（参考訳）: トランスフォーマーは広範に成功し、注目モジュールの計算複雑性は、視覚タスクの大きなボトルネックとして残っている。既存の手法は主に効率と精度のバランスをとるために8ビットまたは4ビットの量子化を用いる。本稿では, 理論的正当化とともに, 注意のバイナライゼーションが本質的な類似性関係を保っていることを示すとともに, 高速かつ高精度な1ビットqk-アテンション法であるBinaryAttentionを提案する。具体的には、注意点の計算におけるクエリとキーのサインのみを保持し、浮動小数点積をビット演算で置き換え、計算コストを大幅に削減する。学習可能なバイアスを組み込むことで1ビット量子化の下での固有情報損失を軽減し、エンドツーエンドの加速を可能にする。注意の正確さを維持するため、我々は量子化学習と自己蒸留技術を採用し、符号整合性を確保しつつ量子化誤差を軽減した。 BinaryAttentionは、A100 GPU上のFlashAttention2よりも2倍以上高速である。ビジョントランスフォーマーと拡散トランスフォーマーベンチマークの広範な実験は、バイナリアテンションが完全精度の注意を越え、その有効性を検証していることを示している。我々の研究は、低ビットビジョンと拡散トランスフォーマーのフロンティアを推し進め、完全精度の注意に対する高効率で効果的な代替手段を提供する。コードとモデルはhttps://github.com/EdwardChasel/BinaryAttentionにある。

論文の概要: BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

関連論文リスト