Fugu-MT 論文翻訳(概要): ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

論文の概要: ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

arxiv url: http://arxiv.org/abs/2510.11168v1
Date: Mon, 13 Oct 2025 08:59:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.280952
Title: ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces
Title（参考訳）: ELMO:大出力空間における低精度・ピークメモリ最適化による効率性
Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar,
Abstract要約: エクストリームマルチラベル分類のための低精度トレーニングフレームワークを提案する。低精度トレーニングと提案されたメモリ最適化を組み合わせることで、GPUメモリ使用量の大幅な削減が可能になる。例えば、最適化SOTA法で要求される39.7 GiBと比較して、GPUメモリの6.6 GiBしか持たない300万ラベルのXMCモデルを訓練する。
参考スコア（独自算出の注目度）: 13.242009624334996
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.
Abstract（参考訳）: XMC (Extreme Multilabel classification) とも呼ばれる大きな出力空間は、大規模なタグ付けや製品間レコメンデーションで発生する設定であり、数十万から数百万のラベルが特徴である。これは、線形分類ヘッド(通常は全体モデルのごく一部)が、計算とメモリ要求のメインドライバになることを意味する。現在の最先端XMC法は、主にFP16-FP32混合精度訓練に依存しており、メモリ使用量や計算オーバーヘッドの点で不安定で非効率であることを示す。一方、既存の低精度手法では、分類層に対して高い精度を維持するのが一般的である。本稿では,BFloat16とFloat8データ型を用いたXMCモデルの低精度トレーニングフレームワークであるELMOを提案する。カハン和和と確率的ラウンドリングを利用して、XMCモデルは単一精度のマスターウェイトやテンソルスケーリングに頼ることなく、フルにFloat8で訓練できることを実証する。低精度トレーニングと提案したメモリ最適化 – 勾配融合とチャンキング – を組み合わせることで,GPUメモリ使用量の大幅な削減が可能になる。例えば、最適化SOTA法で要求される39.7 GiBと比較して、GPUメモリの6.6 GiBしか持たない300万ラベルのXMCモデルを訓練する。

論文の概要: ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

関連論文リスト