Fugu-MT 論文翻訳(概要): Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

論文の概要: Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

arxiv url: http://arxiv.org/abs/2605.18899v1
Date: Sun, 17 May 2026 11:10:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.859952
Title: Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target
Title（参考訳）: BanditのフィードバックでLLM-Recommenderがターゲットからアップデートされる
Authors: Taesan Kim, Hyeongjun Yun, Jaegul Choo, Chung Park,
Abstract要約: ジェネレーティブレコメンダ(LLM-Rec)は、デプロイ後の継続的な更新を必要とする。デプロイメントログはポリシ形式のコンテキスト帯フィードバックのみを提供する。連続LDM-Rec更新のためのアンコレッド帯域ポリシー最適化フレームワークを提案する。
参考スコア（独自算出の注目度）: 42.681980014826536
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving policy, inducing exposure bias and yielding partial, asymmetric signals consisting of relatively reliable positive responses and ambiguous no-responses. We propose an Anchored Bandit Policy Optimization (ABPO) framework for continual LLM-Rec updates that combines group-relative policy optimization (GRPO) with explicit treatment of exposure bias and feedback ambiguity. Specifically, we insert the exposed recommendation as a logged anchor into each GRPO rollout group, so that group-relative normalization is calibrated against the action actually exposed by the prior policy rather than against newly sampled rollouts alone. Because both positive- and no-responses are observed only through prior-policy exposure, we apply self-normalized inverse propensity scoring to the fixed anchor for both feedback types to correct for policy mismatch. At the same time, we treat the two feedback types asymmetrically in reliability: positive responses provide relatively direct endorsement signals, whereas no-responses remain ambiguous because they may reflect either true disinterest or unobserved external factors. To avoid overly aggressive updates from ambiguous no-responses, we temper their penalties with self-certainty, using the model's output-token confidence as a verifier-free reliability signal. Across five domains from Amazon Reviews and MovieLens, our method yields consistent post-update gains in recommendation accuracy while mitigating prior-policy-induced exposure bias more effectively than prior baselines.
Abstract（参考訳）: ジェネレーティブLSMベースのレコメンデータ(LLM-Rec)は、継続的なデプロイ後更新を必要とするが、デプロイメントログはポリシー型のコンテキスト的バンディットフィードバックのみを提供する。本稿では,グループ相対的政策最適化(GRPO)と露出バイアスの明示的処理とフィードバックあいまいさを併用した,連続的なLCM-Rec更新のためのABPO(Anchored Bandit Policy Optimization)フレームワークを提案する。具体的には、各GRPOロールアウトグループにログ付きアンカーとして露出したレコメンデーションを挿入し、グループ相対正規化を、新しくサンプリングされたロールアウトのみに対してではなく、実際に前のポリシーによって露呈されたアクションに対して校正する。自己正規化逆確率スコアを固定アンカーに印加し、両フィードバックタイプでポリシーミスマッチを補正する。正応答は相対的に直接的支持信号を提供するが、非応答は真の非関心または観測されていない外部要因を反映する可能性があるためあいまいなままである。不明瞭な無責任から過度に攻撃的な更新を避けるため、モデルの出力に対する信頼度を検証不要な信頼性信号として利用して、自己確実性で罰則を抑える。 Amazon ReviewsとMovieLensの5つのドメインにまたがって、我々の方法は、事前政治による露出バイアスを以前のベースラインよりも効果的に軽減しつつ、推奨精度の更新後の一貫した上昇をもたらす。

論文の概要: Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

関連論文リスト