Fugu-MT 論文翻訳(概要): EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

論文の概要: EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

arxiv url: http://arxiv.org/abs/2506.17615v1
Date: Sat, 21 Jun 2025 06:54:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-24 19:06:36.505256
Title: EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration
Title（参考訳）: EQuARX:分散機械学習高速化のためのXLAにおける効率的な量子化オールリデューサ
Authors: Ibrahim Ahmed, Clemens Schaefer, Gil Tabak, Denis Vnukov, Zenong Zhang, Felix chern, Anatoliy Yevtushenko, Andy Davis,
Abstract要約: TPU用のXLAコンパイラ(EQuarX)内で、ネイティブな動的ブロックワイドな量子化AllReduceを提案する。 TPUフレンドリーな量子化と通信と計算の深いパイプライン化により、t8精度のEQuARXはベースラインのBF16 AllReduceよりも1.8倍のスピードアップを達成する。
参考スコア（独自算出の注目度）: 3.757632817011334
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have become highly influential, their enormous scale presents significant deployment challenges. Efficiently serving these models typically requires distributing them across numerous accelerator devices, which introduces substantial performance overhead from inter-device communication (collectives). While model quantization has been widely adopted to reduce the memory and compute requirements of LLM weights and activations with minimal quality impact, applying quantization directly to collectives like AllReduce is inherently difficult due to the inter-device summation involved, which can lead to numerical instability or significant error accumulation. In this work, we present a native dynamic block-wise efficient quantized AllReduce within the XLA compiler for TPUs (EQuARX). By using TPU-friendly quantization and deep pipelining of communication and compute, EQuARX with int8 precision achieves a 1.8X speedup over baseline BF16 AllReduce across various network topologies. Furthermore, EQuARX accelerates the prefill stage of Gemma 3 27B by 1.25X and Gemma 3 12B by 1.1X, respectively, with small to negligible impact on quality.
Abstract（参考訳）: 大規模言語モデル(LLM)は大きな影響力を持つようになったが、その巨大なスケールは、重大なデプロイメント上の課題を提示している。これらのモデルを効率的に提供するには、多くのアクセラレータデバイスに分散する必要がある。モデル量子化は、LCMウェイトとアクティベーションのメモリと計算要求を最小限の品質の影響で低減するために広く採用されているが、AllReduceのような集団に直接量子化を適用するのは、デバイス間総和が関係しているため本質的に困難であり、数値的不安定性や重大なエラー蓄積につながる可能性がある。本稿では,TPU(EQuarX)のXLAコンパイラ内で,ネイティブな動的ブロックワイドな量子化AllReduceを提案する。 TPUに親しみやすい量子化と、通信と計算の深いパイプライン化により、t8精度のEQuARXは、様々なネットワークトポロジにわたってベースラインのBF16 AllReduceよりも1.8倍のスピードアップを達成する。さらに、EQuARXは、Gemma 3 27Bのプリフィルステージを1.25X、Gemma 3 12Bを1.1Xで加速し、品質への影響は小さく、無視できない。

論文の概要: EQuARX: Efficient Quantized AllReduce in XLA for Distributed Machine Learning Acceleration

関連論文リスト