Fugu-MT 論文翻訳(概要): Scaling Laws for Behavioral Foundation Models over User Event Sequences

論文の概要: Scaling Laws for Behavioral Foundation Models over User Event Sequences

arxiv url: http://arxiv.org/abs/2606.05257v1
Date: Wed, 03 Jun 2026 15:59:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.289461
Title: Scaling Laws for Behavioral Foundation Models over User Event Sequences
Title（参考訳）: 行動基礎モデルのユーザイベントシーケンスによるスケーリング法則
Authors: Rickard Brüel Gabrielsson,
Abstract要約: 本稿では、共通の2部分の振る舞いモデルアーキテクチャ、特徴ベースのイベント埋め込み、デコーダのみの変換器について検討する。約600回にわたって、実際のインタラクションデータで動作し、トレーニング用FLOPは1015ドルから1019ドルの範囲で、デプロイ関連軸が4つあります。計算最適トレーニングは低計算時のテキストと比較してデータ量が多いが、計算量が増加するにつれて、そのD/N$比はチンチラに向かう。
参考スコア（独自算出の注目度）: 2.924581427482972
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models are increasingly trained on sequences of user actions in recommendation, payments, fraud, and commerce, but these models still lack the kind of compute calibration that scaling laws provide for language models. We study a common two-part behavioral-model architecture: a feature-based event embedder maps each multi-modal item to a vector, and a decoder-only transformer predicts the next event from the resulting sequence. Across roughly 600 runs on real interaction data, spanning $10^{15}$-$10^{19}$ training FLOPs, we jointly vary four deployment-relevant axes: the two-part parameter split, critical batch size, model/data allocation, and the number of sampled negatives used after freezing the embedder. A small embedder ($s^{\star}\!\approx\!2\%$ of parameters) is compute-optimal at every budget we test because embedder parameters are both more expensive per step and exposed to far more repeated items than contextualizer parameters. Compute-optimal training is data-heavy relative to text at low compute, but its $D/N$ ratio moves toward the Chinchilla heuristic as compute increases. The sampled training objective and deployed ranking metrics disagree in ways that themselves scale: critical batch size, optimal negative count after freezing, and the agreement between loss and ranking quality all shift with compute and with the chosen evaluation metric. For negative sampling, larger budgets increasingly prefer more negatives; by $10^{19}$ FLOPs the active constraint is candidate-axis memory rather than FLOPs. In behavioral foundation models, the evaluation metric is therefore part of the scaling law: changing it can change the compute-optimal recipe.
Abstract（参考訳）: ファンデーションモデルは、リコメンデーション、支払い、詐欺、商取引の一連のユーザー行動に基づいて、ますます訓練されているが、これらのモデルには、言語モデルにスケール法が与える計算キャリブレーションが欠けている。特徴に基づくイベント埋め込みは、各マルチモーダルアイテムをベクトルにマッピングし、デコーダのみの変換器は、結果のシーケンスから次のイベントを予測する。 FLOPsのトレーニングには,2つのパラメータスプリット,クリティカルバッチサイズ,モデル/データアロケーション,組込み器の凍結後に使用されるサンプル負の数という,配置関連軸の4つが共同で使用されている。小さな埋め込み機(s^{\star}\! \approx\! なぜなら、埋め込みパラメータはステップ当たりのコストが高く、コンテクストのパラメータよりもはるかに反復的なアイテムに晒されるからです。計算最適トレーニングは低計算時のテキストと比較してデータ量が多いが、そのD/N$比は計算が増加するにつれてチンチラヒューリスティックに移行する。サンプル化されたトレーニング目標とデプロイされたランキングメトリクスは、クリティカルバッチサイズ、凍結後の最適負のカウント、損失とランキング品質の計算によるシフトと選択された評価指標との一致など、それ自体がスケールする方法に異を唱えている。負のサンプリングでは、より大きな予算ではより負の値が好まれる。10^{19}$ FLOP では、アクティブな制約は FLOP ではなく、候補軸メモリである。行動基礎モデルでは、評価基準はスケール法則の一部であり、それを変更すれば計算-最適レシピを変更できる。

論文の概要: Scaling Laws for Behavioral Foundation Models over User Event Sequences

関連論文リスト