Fugu-MT 論文翻訳(概要): JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

論文の概要: JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

arxiv url: http://arxiv.org/abs/2602.00800v1
Date: Sat, 31 Jan 2026 16:15:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.408431
Title: JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
Title（参考訳）: JTok: 共同トークン自己変調によるスケーリング法の別の軸としてのToken Embeddingについて
Authors: Yebin Yang, Huaijin Wu, Fu Guo, Lin Yao, Xiaohan Qin, Jingzhi Wang, Debing Zhang, Junchi Yan,
Abstract要約: 補助埋め込みテーブルから得られる変調ベクトルを用いてトランスフォーマー層を拡大するジョイント・トケン(JTok)とジョイント・トケン(JTok-M)の混合を導入する。これらのベクトルは、軽量な要素演算によってバックボーンを変調し、無視可能なFLOPのオーバーヘッドを発生させる。我々のアプローチは、検証損失を継続的に減らし、ダウンストリームタスクのパフォーマンスを大幅に改善します。
参考スコア（独自算出の注目度）: 46.64215658042213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.
Abstract（参考訳）: LLMは伝統的に密度の高い次元に沿ってスケールしており、計算コストのほぼ直線的な増加と性能が結合している。 MoEは容量を計算から切り離すが、大きなメモリオーバーヘッドとハードウェア効率の課題をもたらす。これらを克服するために、FLOPからモデル容量を分離する新しい直交スケーリング軸としてトークンインデックスパラメータを提案する。具体的には、補助埋め込みテーブルから取得した変調ベクトルでトランスフォーマー層を拡大するジョイント・トケン(JTok)とジョイント・トケン(JTok-M)の混合を導入する。これらのベクトルは、軽量な要素演算によってバックボーンを変調し、無視可能なFLOPのオーバーヘッドを発生させる。 650M (190M + 460M 埋め込み) から61B (17B + 44B 埋め込み) の合計パラメータにまたがる高密度およびMoEのバックボーンに関する広範な実験により、我々のアプローチは検証損失を一貫して低減し、下流タスク性能(MMLUでは+4.1、ARCでは+8.3、CEvalでは+8.9)を大幅に改善することを示した。 Rigorous isoFLOPs analysisにより、JTok-Mは品質計算のParetoフロンティアを根本的にシフトし、バニラMOEアーキテクチャと比較して35%少ない計算で同等のモデル品質を実現することが確認された。さらに、我々の効率的な実装は、JTokとJTok-Mが導入したオーバーヘッドが最短であることを保証する。

論文の概要: JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

関連論文リスト