Fugu-MT 論文翻訳(概要): Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

論文の概要: Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.20150v1
Date: Thu, 23 Oct 2025 02:56:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:17.2205
Title: Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
Title（参考訳）: Rank-GRPO:強化学習によるLLMベースの会話レコメンダシステム
Authors: Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus,
Abstract要約: ConvRec-R1は会話レコメンデーションシステムのエンドツーエンドトレーニングのための2段階のフレームワークである。ステージ1では,Remap-Reflect-Adjustパイプラインを用いた行動閉鎖データセットを構築した。ステージ2では,グループ相対政策最適化の原則的拡張である Rank-GRPO を提案する。
参考スコア（独自算出の注目度）: 74.15352701508009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ユーザが好みを表現したり、会話を通じてレコメンデーションを受けられるようにすることで、レコメンデーションシステムパラダイムを再構築している。しかし、LLMをレコメンデーションタスクに合わせることは困難であり、事前訓練されたLLMは、しばしば触媒外アイテムを生成し、必要な出力フォーマットに違反し、それらのランキング品質は生成されたリストの最後に急激に低下する。そこで本研究では,LLMに基づく会話レコメンデータシステムのエンドツーエンドトレーニングのための2段階フレームワークであるConvRec-R1を提案する。ステージ1では、強力なブラックボックスLLMから高品質なカタロググラウンドのデモを生成し、RLトレーニングをウォームスタートするRemap-Reflect-Adjustパイプラインを用いて行動閉鎖データセットを構築した。ステージ2では、ランクスタイルの出力を持つタスクに適したグループ相対ポリシー最適化(GRPO)の原則拡張であるランク-GRPOを提案する。 Rank-GRPOは、レコメンデーションリストの各ランクをトークン(きめ細かい)やシーケンス(きめ細かな)ではなく単位として扱い、非因果的クレジット代入を取り除く報酬を再定義し、ランクワイドトークン確率の幾何学的平均に基づいてランクレベルの重要度を導入してポリシー更新を安定化させる。公開Reddit-v2データセットの実験によると、ConvRec-R1はGRPOスタイルのベースラインよりも早く収束し、リコールとNDCGが向上している。コードとデータセットはhttps://github.com/yaochenzhu/Rank-GRPO.orgで公開されている。

論文の概要: Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

関連論文リスト