Fugu-MT 論文翻訳(概要): Reinforced Preference Optimization for Recommendation

論文の概要: Reinforced Preference Optimization for Recommendation

arxiv url: http://arxiv.org/abs/2510.12211v1
Date: Tue, 14 Oct 2025 07:04:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.218434
Title: Reinforced Preference Optimization for Recommendation
Title（参考訳）: Reinforced Preference Optimization for Recommendation
Authors: Junfei Tan, Yuxin Chen, An Zhang, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Xiang Wang,
Abstract要約: 本稿では,レコメンデーションのためのReinforced Preference Optimization for Recommendation (ReRe)を提案する。 ReReは制約ビーム探索を取り入れてサンプリング効率を改善し、ハードネガを多様化する。 ReRe は従来型と LLM ベースのレコメンデータのランク付け性能を一貫して上回っていることを示す。
参考スコア（独自算出の注目度）: 28.87206911186567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research.
Abstract（参考訳）: 近年の大規模言語モデル (LLM) のブレークスルーは, 歴史的相互作用を前提とした目標項目を生成することによって, ユーザの行動モデリングを実現する, 識別的パラダイムから生成的パラダイムへと, 推薦システムを根本的にシフトさせている。しかし、現在の生成的推奨者は、高品質なネガティブモデリングの欠如と暗黙の報酬への依存という、2つの主要な制限に悩まされている。検証可能な報酬付き強化学習(RLVR)は、より厳しい負のオンラインサンプリングと明示的な報酬信号のグラウンド最適化を可能にすることで、自然なソリューションを提供する。しかし、RLVRを生成レコメンデーションに応用することは、まだ容易ではない。そのユニークな生成空間は、しばしばサンプリング効率を損なう不正または反復的なアイテムをもたらし、ほとんどのアイテムが同じゼロ報酬を受けるため、ランキングの監督は不十分である。これらの課題に対処するために,LLMベースのレコメンデータに適した強化型パラダイムであるReinforced Preference Optimization for Recommendation (ReRe)を提案する。 ReReは、制限されたビームサーチを取り入れてサンプリング効率を改善し、ハードネガティブを多様化し、ルールベースの精度報酬を、よりきめ細かい監督のために補助的なランキング報酬で強化する。 3つの実世界のデータセットに対する大規模な実験により、ReReはランキングパフォーマンスにおいて従来のものとLLMベースのレコメンデータの両方を一貫して上回っていることが示された。さらなる分析により、ReReはベースモデルとSFT初期化モデルの両方のパフォーマンスを向上するだけでなく、異なるバックボーンファミリやスケールにわたって堅牢に一般化することが示された。実証的なゲインの他に、生成、サンプリング戦略、報酬モデリング、最適化アルゴリズムなどを通じてRLVRの設計空間を体系的に検討し、今後の研究の洞察を提供する。

論文の概要: Reinforced Preference Optimization for Recommendation

関連論文リスト