Fugu-MT 論文翻訳(概要): Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

論文の概要: Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

arxiv url: http://arxiv.org/abs/2603.10279v1
Date: Tue, 10 Mar 2026 23:48:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.722327
Title: Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
Title（参考訳）: ジェネレーションレコメンダのロバストポストトレーニング:なぜ指数リワード重み付きSFTがRLHFより優れているのか
Authors: Keertana Chidambaram, Sanath Kumar Krishnamurthy, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya,
Abstract要約: 既存のトレーニングメソッドは、ノイズの多いユーザフィードバックと信頼できない報酬モデルによるハックを報いる。指数的報酬重み付き SFT の重みが $w = exp(r/)$ であることは、この設定に一意に適している。我々は、この設定に対する最初の政策改善保証を、騒々しい報奨のもとに証明する。
参考スコア（独自算出の注目度）: 7.2858507889096815
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights $w = \exp(r/λ)$ as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards, showing that the gap scales only logarithmically with catalog size and remains informative even for large item catalogs. Crucially, we show that temperature $λ$ explicitly and quantifiably controls the robustness-improvement tradeoff, providing practitioners with a single interpretable regularization hyperparameter with theoretical grounding. Experiments on three open-source and one proprietary dataset against four baselines confirm that exponential reward weighting is simple, scalable, and consistently outperforms RLHF-based alternatives.
Abstract（参考訳）: 次点予測と実際のレコメンデーション品質のギャップを埋めるためには、ポストトレーニングによるユーザ好みに生成レコメンデーションシステムを調整することが重要である。 RLHFメソッドは、ノイズの多いユーザフィードバックと信頼できない報酬モデルによるハック、オフラインのRL代替手段は、利用できない確率スコアを必要とし、オンラインインタラクションは実現不可能である。指数的報酬重み付き SFT の重みが $w = \exp(r/λ)$ であることは、この設定に一意に適しており、その理由を説明する理論的および経験的基礎を提供する。学習した報酬モデルに問い合わせることなく、観察された報酬を直接最適化することにより、この手法は、報酬のハッキングに無害であり、正当性スコアを必要とせず、完全にオフラインである。本研究は,この設定に対する最初の政策改善保証をうるさい報奨のもとに証明し,そのギャップはカタログのサイズと対数的にしかスケールせず,大型品のカタログにおいても情報的のままであることを示す。重要なことは、温度がλ$を明示的に、定量的にロバスト性改善のトレードオフを制御し、理論的な基底を持つ単一の解釈可能な正規化ハイパーパラメータを実践者に提供することである。 3つのオープンソースと4つのベースラインに対する1つのプロプライエタリデータセットの実験により、指数的な報酬重み付けは単純でスケーラブルであり、RLHFベースの代替よりも一貫して優れていることが確認された。

論文の概要: Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

関連論文リスト