Fugu-MT 論文翻訳(概要): LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

論文の概要: LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

arxiv url: http://arxiv.org/abs/2605.10886v2
Date: Wed, 13 May 2026 20:59:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 15:19:49.885439
Title: LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
Title（参考訳）: LoKA: 大規模リコメンデーションモデルのための低精度カーネルアプリケーション
Authors: Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang,
Abstract要約: 大規模レコメンデーションモデル(LRM)にFP8を実用化するフレームワークであるLoKAを提案する。 LoKA Probeは、アクティベーションとウェイト統計を学習し、層ごとのエラーを定量化する、統計的に基礎付けられたオンラインベンチマーク手法である。 LoKA Dispatchは、LoKA Probeの統計情報を利用して最速のFP8カーネルを選択するランタイムである。
参考スコア（独自算出の注目度）: 19.273840159657983
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.
Abstract（参考訳）: 最近のGPU世代は、FP8のような低精度演算を用いてFLOPを著しく高めている。大規模言語モデル(LLM)への適用は成功したが、大規模なレコメンデーションモデル(LRM)への採用は制限されている。これは、LEMは数値的に敏感であり、小さな行列乗法(GEMM)に支配され、その後正規化され、通信集約環境で訓練されるためである。 FP8を直接LRMに適用すると、モデルの品質が低下し、トレーニング時間が短縮されることが多い。これらの課題は LRM のワークロードに固有のものであり、FP8 カーネルの改善によってのみ解決できない。代わりに、FP8をうまく統合するにはシステムモデルの共同設計アプローチが必要である。提案するLoKA(Low-precision Kernel Applications)は,低精度の安全な場所を知るために,現実的な分布の下でFP8を実用化するフレームワークである。具体的には、LoKA Probeは、アクティベーションとウェイト統計を学習し、層ごとのエラーを定量化する、統計的に基礎付けられたオンラインベンチマーク手法である。このプロセスは、安全で安全でない、高速で遅いサイトをFP8採用のために特定する。 LoKA ModsはFP8で数値安定性と実行効率を改善する再利用可能なモデル適応のセットである。 LoKA Dispatchは、LoKA Probeの統計情報を利用して、精度要件を満たす最速のFP8カーネルを選択するランタイムである。

論文の概要: LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

関連論文リスト