Fugu-MT 論文翻訳(概要): F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

論文の概要: F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

arxiv url: http://arxiv.org/abs/2605.12995v1
Date: Wed, 13 May 2026 04:52:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.815493
Title: F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
Title（参考訳）: F-GRPO:統一候補生成とランク付けのための要因付きグループ関連政策最適化
Authors: Rohan Surana, Gagan Mundada, Junda Wu, Xintong Li, Yizhu Jiao, Bowen Jin, Sizhe Zhou, Tong Yu, Ritwik Sinha, Jiawei Han, Jingbo Shang, Julian McAuley,
Abstract要約: 大規模言語モデル(LLM)はサブセットを生成し、それを1つの自己回帰パス内で順序付けることができる。この柔軟性は、新しい最適化課題をもたらす: モデルが出力空間を検索し、完全なランクリストが生成された後にのみユーティリティフィードバックを受けなければならない。このクレジット割り当てギャップは、エンドツーエンドの最適化を不安定にし、サンプル非効率にする。本稿では,単一自己回帰的ロールアウト内の両方を実行する統一フレームワークを提案する。
参考スコア（独自算出の注目度）: 79.49893545611779
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.
Abstract（参考訳）: 従来の検索パイプラインは、事前定義された候補セット上でランク付けを行う候補検索と再ランク付けの段階を通じて、ユーティリティを最適化する。大規模言語モデル(LLM)は、これを生成プロセスへと拡張する: 候補プールが与えられたら、LLMはサブセットを生成し、単一の自己回帰パス内でそれを順序付けることができる。しかし、この柔軟性は、新しい最適化課題をもたらす: モデルは、完全なランクリストが生成された後にのみユーティリティフィードバックを受けながら、組合せ出力空間を探索しなければならない。このフィードバックは完了したシーケンス上で定義されるため、悪い結果が関連するサブセットの生成に失敗したり、そのサブセットを正しくランク付けできなかったりするかどうかを区別することはできない。このクレジット割り当てギャップは、エンドツーエンドの最適化を不安定にし、サンプル非効率にする。既存のシステムでは、候補生成をランキングから分離することで、この問題に対処することが多い。しかし、そのような疎結合は、受信した候補セットによってランクが制限されるため、下流ユーティリティと不一致のままである。このギャップを埋めるため,我々は単一自己回帰的ロールアウト内の両方を実行する統一的なフレームワークを提案し,F-GRPO (F-GRPO) を通じてそれらをエンドツーエンドに最適化する。本フレームワークは, 1つのLCMバックボーンを共有しながら, ポリシーを候補生成とランキングに分解し, オーダー不変のカバレッジ報酬と位置対応ユーティリティ報酬を共同でトレーニングする。フェーズ固有のクレジット代入問題に対処するため,2段階のシーケンスレベル目標内で生成とランク付けを行うグループ相対的優位性を別々に利用した。シーケンシャルなレコメンデーションとマルチホップの質問応答ベンチマークを通じて、F-GRPOはGRPOと分離されたベースラインよりもトップランクのパフォーマンスを改善し、教師付き代替品よりも優れており、推論時にアーキテクチャ上の変更はなく、強力なゼロショットリランカと競合する。

論文の概要: F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

関連論文リスト