Fugu-MT 論文翻訳(概要): TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

論文の概要: TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

arxiv url: http://arxiv.org/abs/2602.06563v1
Date: Fri, 06 Feb 2026 10:04:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-09 22:18:26.343669
Title: TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders
Title（参考訳）: TokenMixer-Large: 業界のレコメンデーションにおける大規模ランキングモデルのスケールアップ
Authors: Yuchen Jiang, Jie Zhu, Xintian Han, Hui Lu, Kunmin Bai, Mingyu Yang, Shikang Wu, Ruihao Zhang, Wenlin Zhao, Shipeng Bai, Sijin Zhou, Huizhi Yang, Tianyi Liu, Wenda Liu, Ziyan Gong, Haoran Ding, Zheng Chai, Deping Xie, Zhe Chen, Yuchao Zheng, Peng Xu,
Abstract要約: TokenMixer-Largeは、大規模なレコメンデーションモデルをスケールするための新しいアーキテクチャである。準最適残差設計、深層モデルでの勾配更新の不十分、不完全なMoEスペーサー化、拡張性の調査に対処する。オンライントラフィックとオフラインの実験で、それぞれ7ビリオンと15ビリオンにパラメータを拡大することに成功している。
参考スコア（独自算出の注目度）: 28.610671210049247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the study of scaling laws for large recommendation models has gradually gained attention. Works such as Wukong, HiFormer, and DHEN have attempted to increase the complexity of interaction structures in ranking models and validate scaling laws between performance and parameters/FLOPs by stacking multiple layers. However, their experimental scale remains relatively limited. Our previous work introduced the TokenMixer architecture, an efficient variant of the standard Transformer where the self-attention mechanism is replaced by a simple reshape operation, and the feed-forward network is adapted to a pertoken FFN. The effectiveness of this architecture was demonstrated in the ranking stage by the model presented in the RankMixer paper. However, this foundational TokenMixer architecture itself has several design limitations. In this paper, we propose TokenMixer-Large, which systematically addresses these core issues: sub-optimal residual design, insufficient gradient updates in deep models, incomplete MoE sparsification, and limited exploration of scalability. By leveraging a mixing-and-reverting operation, inter-layer residuals, the auxiliary loss and a novel Sparse-Pertoken MoE architecture, TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer -Large has achieved significant offline and online performance gains.
Abstract（参考訳）: 近年,大規模レコメンデーションモデルのスケーリング法則の研究が徐々に注目されている。 Wukong、HiFormer、DHENといった作業は、ランキングモデルにおける相互作用構造の複雑さを高め、複数のレイヤを積み重ねることで、パフォーマンスとパラメータ/FLOP間のスケーリング法則を検証しようと試みている。しかし、実験規模は比較的限られている。本稿では,TokenMixerアーキテクチャを導入した。TokenMixerアーキテクチャは,自己保持機構を簡単なリフォーム操作に置き換え,フィードフォワードネットワークをパートーケンFFNに適応させる。このアーキテクチャの有効性は、RangeMixerの論文で示されたモデルによって、ランキング段階で実証された。しかし、この基礎的なTokenMixerアーキテクチャ自体にはいくつかの設計上の制限がある。本稿では,これらの問題に体系的に対処するTokenMixer-Largeを提案する。 TokenMixer-Largeはミキシング・アンド・リターン操作、層間残差、補助損失、新しいスパース・パートケンMoEアーキテクチャを活用することで、オンライントラフィックとオフライン実験でパラメータを7ビリオンと15ビリオンに拡張した。 TokenMixer -Largeは現在、ByteDanceの複数のシナリオにデプロイされている。

論文の概要: TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

関連論文リスト