Fugu-MT 論文翻訳(概要): Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

論文の概要: Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

arxiv url: http://arxiv.org/abs/2509.09682v2
Date: Fri, 24 Oct 2025 16:19:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 06:57:23.323314
Title: Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs
Title（参考訳）: 大規模カタログの逐次推薦モデルの高速化とメモリ効率向上
Authors: Maxim Zhelnin, Dmitry Redko, Volkov Daniil, Anna Volodkevich, Petr Sokerin, Valeriy Shevchenko, Egor Shvetsov, Alexey Vasilev, Darya Denisova, Ruslan Izmailov, Alexey Zaytsev,
Abstract要約: 負サンプリングによるクロスエントロピー損失をGPU効率よく実装するCCE-法を提案する。本手法は,メモリ消費を10倍以上削減しつつ,最大2倍のトレーニングを高速化する。
参考スコア（独自算出の注目度）: 3.0832329178398967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequential recommendations (SR) with transformer-based architectures are widely adopted in real-world applications, where SR models require frequent retraining to adapt to ever-changing user preferences. However, training transformer-based SR models often encounters a high computational cost associated with scoring extensive item catalogs, often exceeding thousands of items. This occurs mainly due to the use of cross-entropy loss, where peak memory scales proportionally to catalog size, batch size, and sequence length. Recognizing this, practitioners in the field of recommendation systems typically address memory consumption by integrating the cross-entropy (CE) loss with negative sampling, thereby reducing the explicit memory demands of the final layer. However, a small number of negative samples would degrade model performance, and as we demonstrate in our work, increasing the number of negative samples and the batch size further improves the model's performance, but rapidly starts to exceed industrial GPUs' size (~40Gb). In this work, we introduce the CCE- method, which offers a GPU-efficient implementation of the CE loss with negative sampling. Our method accelerates training by up to two times while reducing memory consumption by more than 10 times. Leveraging the memory savings afforded by using CCE- for model training, it becomes feasible to enhance its accuracy on datasets with a large item catalog compared to those trained with original PyTorch-implemented loss functions. Finally, we perform an analysis of key memory-related hyperparameters and highlight the necessity of a delicate balance among these factors. We demonstrate that scaling both the number of negative samples and batch size leads to better results rather than maximizing only one of them. To facilitate further adoption of CCE-, we release a Triton kernel that efficiently implements the proposed method.
Abstract（参考訳）: トランスフォーマーベースのアーキテクチャを備えた逐次レコメンデーション(SR)は現実世界のアプリケーションで広く採用されている。しかし、トランスフォーマーをベースとしたSRモデルのトレーニングは、広範囲のアイテムカタログを収集する際の高い計算コストに直面することが多い。これは主にクロスエントロピー損失(英語版)の使用によるもので、ピークメモリはカタログサイズ、バッチサイズ、シーケンス長に比例してスケールする。これを認識し、レコメンデーションシステムの実践者は、一般的に、クロスエントロピー(CE)損失を負のサンプリングと統合することにより、最終層の明示的なメモリ要求を減らすことで、メモリ消費に対処する。しかし、少数の負のサンプルがモデルの性能を低下させ、我々の研究で示すように、負のサンプルの数が増え、バッチサイズがモデルの性能をさらに向上させるが、急速に工業用GPU(約40Gb)を超え始める。本稿では,CE損失を負のサンプリングでGPU効率よく実装するCCE法を提案する。本手法は,メモリ消費を10倍以上削減しつつ,最大2倍のトレーニングを高速化する。モデルトレーニングにCCE-を用いることで得られるメモリ節約を活用すれば、オリジナルのPyTorch実装の損失関数と比較すると、大きな項目カタログを持つデータセット上での精度を高めることが可能になる。最後に、キーメモリ関連ハイパーパラメータの分析を行い、これらの要因間の微妙なバランスの必要性を強調した。負のサンプル数とバッチサイズの両方のスケーリングが、その中の1つだけを最大化するのではなく、より良い結果をもたらすことを示した。 CCEのさらなる採用を容易にするため,提案手法を効率的に実装したTritonカーネルをリリースする。

論文の概要: Faster and Memory-Efficient Training of Sequential Recommendation Models for Large Catalogs

関連論文リスト