Fugu-MT 論文翻訳(概要): Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

論文の概要: Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

arxiv url: http://arxiv.org/abs/2604.22504v1
Date: Fri, 24 Apr 2026 12:31:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.454802
Title: Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
Title（参考訳）: ハードネガティティブな形状:RL型LLMレコメンダの窓部分AUC最適化
Authors: Wentao Shi, Qifan Wang, Chen Chen, Fei Liu, Dongfang Liu, Xu Liu, Wanli Ma, Junfeng Pan, Linhong Zhu, Fuli Feng,
Abstract要約: ビームサーチの負のトレーニングは、ランダムな負のトレーニングよりも一貫して優れている。ここでは,ウィンドウに偽陽性率(FPR)を制約し,Top-K$メトリクスとより直接整合させるウィンドウ付き部分AUC(WPAUC)を紹介する。 4つの実世界のデータセットの実験は、理論を検証し、一貫した最先端のパフォーマンスを提供する。
参考スコア（独自算出の注目度）: 74.55181072260713
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$α,α+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
Abstract（参考訳）: 強化学習(RL)は,肯定的,否定的な項目を対比することにより,Large Language Model(LLM)ベースのレコメンデータを効果的に最適化する。経験的に、ビームサーチの負のトレーニングはランダムな負よりも一貫して優れているが、そのメカニズムはよく理解されていない。誘導最適化の目的を解析して、このギャップに対処する。 (i)二進的報酬フィードバックの下では、グループ相対政策最適化(GRPO)によるLLM勧告を最適化することは理論上はROC曲線(AUC)の下でのエリアの最大化と等価であり、しばしばトップ・ドル・レコメンデーションと不一致である。 (II)ビームサーチ陰性によるランダムな負の置き換えは、部分的なAUCに対する目的に反し、Top-K$メトリクスとの整合性を改善する。この観点から、FPR(False positive rate)をウィンドウ[$α,α+d$]に制約し、Top-$K$メトリクスとより直接整合するウィンドウ部分AUC(WPAUC)を導入する。さらに,最適化のためのThreshold-Adjusted Windowed Reweighting (TAWin) RL法を提案する。 4つの実世界のデータセットの実験は、理論を検証し、一貫した最先端のパフォーマンスを提供する。

論文の概要: Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

関連論文リスト