Fugu-MT 論文翻訳(概要): Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

論文の概要: Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

arxiv url: http://arxiv.org/abs/2604.02292v1
Date: Thu, 02 Apr 2026 17:32:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.972836
Title: Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Title（参考訳）: Integer-Native Edge推論のための高速ソフトマックスサロゲート
Authors: Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini,
Abstract要約: 本稿では,最大集中型注目ロジットのクリップ付き線形写像を用いて,指数的ソフトマックス関数に対する有界単調サロゲートを提案する。この近似は安定確率分布を生成し、元のロジットの順序を保ち、負の値を持たない。本稿では、AMD Versal AI Engineをターゲットとした高スループットシナリオのためのHCCSのハードウェアモチベーション実装について述べる。
参考スコア（独自算出の注目度）: 0.8488076117647583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Softmax can become a computational bottleneck in the Transformer model's Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines' int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
Abstract（参考訳）: ソフトマックスはトランスフォーマーモデルのマルチヘッドアテンション(MHA)ブロックにおいて計算のボトルネックとなり、特に低精度の推論の下では、指数化と正規化がかなりのオーバーヘッドを引き起こす。そこで我々は,最大中心の注目ロジットをクリップした線形写像を用いた指数的ソフトマックス関数に対して,有界単調な単調サロゲートであるHCCS(Head-Calibrated Clipped-Linear Softmax)を用いることを提案する。この近似は安定確率分布を生成し、元のロジットの順序を保ち、負の値を持たない。 HCCSは従来のソフトマックスサロゲートと異なり、軽量なキャリブレーションパラメータのセットを含み、代表データセットに基づいてオフラインで最適化され、個々のアテンションヘッドごとにキャリブレーションされ、個々のヘッドの統計特性を保存する。本稿では,AMD Versal AI Engineをターゲットとした高スループットシナリオのためのハードウェアによるHCCSの実装について述べる。このプラットフォームに対するAMDの現在のリファレンス実装は、指数演算を実行するためにbfloat16演算またはLUTに依存しており、これはプラットフォームのスループットを制限し、AIエンジンの高スループット整数ベクトル処理ユニットを使用することができない可能性がある。対照的に、HCCSはAIエンジンのint8乗算(MAC)ユニットへの自然なマッピングを提供する。我々の知る限り、これはAMD AIエンジンのための最初のint8最適化ソフトマックスサロゲートであり、これは、量子化を意識した再トレーニング後の小型または重定量化したMHAワークロード上での競合タスク精度を維持しながら、他の参照実装の速度性能を大幅に上回っている。

論文の概要: Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference

関連論文リスト