Fugu-MT 論文翻訳(概要): The Peril of Preference: Why GRPO fails on Ordinal Rewards

論文の概要: The Peril of Preference: Why GRPO fails on Ordinal Rewards

arxiv url: http://arxiv.org/abs/2511.04439v1
Date: Thu, 06 Nov 2025 15:12:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 20:17:53.470129
Title: The Peril of Preference: Why GRPO fails on Ordinal Rewards
Title（参考訳）: GRPOが通常のリワードで失敗する理由
Authors: Anisha Garg, Ganesh Venkatesh,
Abstract要約: 我々は、この欠陥を解決する新しい定式化であるCoRPO(Correctness Relative Policy Optimization)を導入する。 CoRPOは適応ベースラインを使用し、最小品質の閾値を強制する。コード検証タスクにおいて、CoRPOを実証的に検証し、より安定した収束とドメイン外一般化を実証する。
参考スコア（独自算出の注目度）: 0.8937905773981699
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group-relative Policy Optimization's (GRPO) simplicity makes it highly desirable for adapting LLMs to become experts at specific tasks. But this simplicity also makes it ill-specified as we seek to enhance RL training with richer, non-binary feedback. When using ordinal rewards to give partial credit, GRPO's simplicity starts to hurt, as its group-average baseline often assigns a positive advantage to failed trajectories and reinforces incorrect behavior. We introduce Correctness Relative Policy Optimization (CoRPO), a new formulation that solves this flaw. CoRPO uses an adaptive baseline that enforces a minimum quality threshold, ensuring failed solutions are never positively reinforced. Once the policy consistently meets this threshold, the baseline automatically transitions to a relative preference mode, pushing the model to find optimal solutions rather than just "acceptable" ones. We empirically validate CoRPO on a code verification task, where it demonstrates more stable convergence and better out-of-domain generalization. This work represents a critical step in our broader research program to enable LLMs to learn genuinely new capabilities through reinforcement learning. We achieve this by enabling LLMs to learn from rich, multi-dimensional feedback - progressing from binary to ordinal rewards in this work, and onward to denser, per-step supervision.
Abstract（参考訳）: グループ相対的政策最適化(GRPO)の単純さは、LLMを特定のタスクの専門家に適応させることを非常に望ましいものにします。しかし、この単純さは、よりリッチで非バイナリなフィードバックでRLトレーニングを強化しようとしているため、不明確になります。順序報酬を使って部分的な信用を与えると、GRPOの単純さは悪化し始め、グループ平均ベースラインはしばしば失敗する軌道に対して肯定的な優位性を与え、誤った行動を補強する。我々は、この欠陥を解決する新しい定式化であるCoRPO(Correctness Relative Policy Optimization)を導入する。 CoRPOは適応ベースラインを使用し、最小品質の閾値を強制する。ポリシーが常にこのしきい値を満たすと、ベースラインは自動的に相対的な選好モードに遷移し、単に「許容できる」ものではなく、最適解を見つけるようにモデルを押し付ける。コード検証タスクにおいて、CoRPOを実証的に検証し、より安定した収束とドメイン外一般化を実証する。この研究は、LLMが強化学習を通じて真に新しい能力を学べるようにするための幅広い研究プログラムにおける重要なステップである。我々は、LLMがリッチで多次元的なフィードバックから学ぶことを可能にすることで、この作業においてバイナリから順序的な報酬へと前進し、さらにより密集したステップごとの監視へと進むことができる。

論文の概要: The Peril of Preference: Why GRPO fails on Ordinal Rewards

関連論文リスト