Fugu-MT 論文翻訳(概要): When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

論文の概要: When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

arxiv url: http://arxiv.org/abs/2603.13134v1
Date: Fri, 13 Mar 2026 16:25:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.192957
Title: When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO
Title（参考訳）: 右が誤る時--GRPOの逆信頼補正による両側文脈条件付け
Authors: Yu Li, Tian Lan, Zhengling Qi,
Abstract要約: グループ相対政策最適化(GRPO)は、推論モデルを訓練するための効果的な方法として登場した。本稿では,GRPOの目的が正解率と正解率とのマージンを暗黙的に最大化することを示す。本稿では,モデルが相互参照を成功させる機構であるバイラテラルコンテキストコンディショニング(BICC)を提案する。
参考スコア（独自算出の注目度）: 18.988527161000203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.
Abstract（参考訳）: グループ相対政策最適化(GRPO)は、推論モデルを訓練するための効果的な方法として登場した。 GRPOは、グループ平均に基づいて利点を計算するが、最適化中に各出力を独立したサンプルとして扱い、重要な構造信号を見落としている。これを活用するために, GRPO の目的が正解と誤解の政策比のマージンを暗黙的に最大化することを示す, GRPO の対照的な再構成を提案する。この知見に基づいて、最適化中にモデルが相互参照を成功させ、推論トレースを失敗させるメカニズムであるBilateral Context Conditioning (BICC)を提案し、サンプル間の直接的な情報フローを可能にする。さらに、分散最小化推定器の1次近似から導かれる報酬信頼共分散を用いてGRPOの利点ベースラインを動的に調整し、トレーニングの安定化を図るために、Reward-Confidence Correction (RCC)を導入する。どちらの機構も追加のサンプリングや補助モデルを必要としないため、すべてのGRPOに適応できる。数学的推論ベンチマークの実験では、包括的なモデルとアルゴリズム間で一貫した改善が示されている。コードは \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC} で入手できる。

論文の概要: When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

関連論文リスト