Fugu-MT 論文翻訳(概要): ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

論文の概要: ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

arxiv url: http://arxiv.org/abs/2605.28293v1
Date: Wed, 27 May 2026 10:43:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.983271
Title: ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
Title（参考訳）: ProRL:Rectified Policy Gradient Estimationによる積極的な勧告のための効果的な強化学習
Authors: Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang,
Abstract要約: Proactive Recommender Systems (PRSs) は,中間勧告の経路を生成することによって,ユーザの嗜好の目標項目へのシフトを誘導することを目的としている。プロアクティブなレコメンデーションのための2つの新しいメカニズムを持つ有効RLフレームワークProRLを提案する。実世界の3つのデータセットに対する実験により、ProRLは最先端のPSSよりも大幅に優れていることが示された。
参考スコア（独自算出の注目度）: 22.61175161826679
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.
Abstract（参考訳）: Proactive Recommender Systems (PRSs) は,中間勧告の経路を生成することによって,ユーザの嗜好の目標項目へのシフトを誘導することを目的としている。強化学習(Reinforcement Learning, RL)は、このようなシーケンシャルな意思決定タスクを最適化するための原則的なフレームワークを提供する。しかし、政策勾配をPSSに適用すると、不十分な勾配推定が生じる。 1)経路レベルの報酬を正の平均でステップレベルの報酬に分解し,勾配に意味のある探索よりも経路拡張を優先させる長さ依存バイアスを生じさせ,(2)経路レベルの報酬全体を重み付けすることで分解構造を無視し,勾配のばらつきを生じさせる。これら2つの欠陥を正すために,プロアクティブレコメンデーションのための2つの新しいメカニズムを持つ有効なRLフレームワークProRLを提案する。第一に、ステップワイズ・リワード・センターリング(Stepwise Reward Centering)は、長さ依存バイアスを中和するために期待される報酬を抽出し、経路拡張が期待される勾配信号がゼロになることを保証する。第二に、位置特化アドバンテージ推定は報酬分解構造を利用してステップ依存ベースラインを計算し、勾配分散を低減させる。これらのメカニズムが組み合わさって、パスの品質を正確に目標とするポリシー勾配が生まれる。実世界の3つのデータセットに対する実験により、ProRLは最先端のPSSよりも大幅に優れていることが示された。私たちのコードはhttps://github.com/hongruhou89/ProRL.comで公開されています。

論文の概要: ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

関連論文リスト