Fugu-MT 論文翻訳(概要): TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

論文の概要: TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

arxiv url: http://arxiv.org/abs/2605.13433v1
Date: Wed, 13 May 2026 12:26:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.039735
Title: TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation
Title（参考訳）: TurboGR:大規模生成レコメンデーションのための高速化トレーニングシステム
Authors: Huichao Chai, Zhixin Wu, Xuemiao Li, Shiqing Fan, Hengfeng Wang, Maojun Peng, Lu Xu, Yaoyuan Wang, Yibo Jin, Wei Guo, Yongxiang Feng,
Abstract要約: ジェネレーティブレコメンデーション(GR)は、断片化されたシナリオ固有のアーキテクチャをトランスフォーマーベースの統一モデルに置き換える、有望なパラダイムとして登場した。 Ascend NPU上でGRを大規模にデプロイすることは、システムレベルの根本的な課題に直面します。生成推薦のためのアセンド・アフィニティ・トレーニング・システムであるモデルを提案する。
参考スコア（独自算出の注目度）: 9.645364292862624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative recommendation (GR) has emerged as a promising paradigm that replaces fragmented, scenario-specific architectures with unified Transformer-based models, exhibiting scaling-law behavior where recommendation quality improves systematically with increased model capacity and training data. However, deploying GR at scale on Ascend NPUs faces fundamental system-level challenges. These challenges are further exacerbated on Ascend NPUs due to the absence of high-performance implementations for jagged operators and the architectural mismatch between irregular sparse primitives and NPU's dense-computation-optimized design. In this paper, we present \model, an Ascend-affinity training system for generative recommendation that systematically addresses these bottlenecks through three core innovations: (i) Ascend-affinity jagged acceleration, including fusion operators that eliminate padding redundancy and dynamic load balancing that reduces inter-device imbalance from 47\% to 2.4\%; (ii) distributed communication optimization, comprising hierarchical sparse parallelism, semi-asynchronous training with proven convergence guarantees, and fine-grained pipeline orchestration that sustains 94\% NPU utilization; and (iii) negative sampling optimization via asynchronous offloading, jaggedness-aware FP16 quantization, and intra-batch logit sharing that expand the effective negative space without additional embedding lookups. Evaluated on the KuaiRand-27K dataset, \model supports training at up to 0.2B parameters and achieves 54.71\% MFU with near-linear scalability (0.97).
Abstract（参考訳）: ジェネレーティブレコメンデーション(GR)は、断片化されたシナリオ固有のアーキテクチャをトランスフォーマーベースの統一モデルに置き換える、有望なパラダイムとして浮上し、モデルキャパシティとトレーニングデータの増加によって、推奨品質が体系的に改善する、スケーリング法則の振る舞いを示す。しかし、 Ascend NPU 上で GR を大規模にデプロイすることは、システムレベルの根本的な課題に直面している。これらの課題は、ジャグ演算子の高性能実装が欠如していることや、不規則なスパースプリミティブとNPUの高密度計算最適化設計とのアーキテクチャミスマッチにより、Ascend NPUでさらに悪化している。本稿では,3つの中心となるイノベーションを通じて,これらのボトルネックに体系的に対処する生成レコメンデーションのためのアセンド・アフィニティ・トレーニングシステムである \model を提案する。一装置間不均衡を47 %から2.4 %に減少させる、パッドの冗長性及び動的負荷分散を除去する融合演算子を含むアセンド親和性ジャッジ加速 (II)階層的なスパース並列性、コンバージェンス保証を証明した半非同期トレーニング、94%のNPU使用率を維持する細粒度パイプラインオーケストレーションを含む分散通信最適化。 3)非同期オフロード,ジャグネス対応FP16量子化,バッチ内ロジット共有による負のサンプリング最適化。 KuaiRand-27Kデータセットに基づいて評価され、最大0.2Bパラメータでのトレーニングをサポートし、ニア線形スケーラビリティ(0.97)で54.71\% MFUを達成する。

論文の概要: TurboGR: An Accelerated Training System for Large-Scale Generative Recommendation

関連論文リスト