Fugu-MT 論文翻訳(概要): IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

論文の概要: IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2601.00677v1
Date: Fri, 02 Jan 2026 12:57:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-05 15:04:33.571032
Title: IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning
Title（参考訳）: IRPO:強化学習によるBradley-Terryモデルのスケーリング
Authors: Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang,
Abstract要約: Intergroup Relative Preference Optimization (IRPO)は、確立されたBradley-TerryモデルをGRPOに組み込んだ新しいRLフレームワークである。各応答に対してポイントワイズスコアを生成することにより、IRPOはRLトレーニング中に任意に多くの候補を効率的に評価することができる。実験の結果,IRPOはポイントワイドGRM間のSOTA(State-of-the-art)性能を達成できた。
参考スコア（独自算出の注目度）: 11.499402258204375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.
Abstract（参考訳）: ジェネレーティブ・リワード・モデル(GRM)は、解釈可能性、推論時スケーラビリティ、強化学習(RL)による洗練の可能性から、報酬モデリングにかなりの研究関心を集めている。しかし、広く使われているペアワイズ GRM は、グループ相対ポリシー最適化(GRPO)のようなRLアルゴリズムと統合した場合に計算ボトルネックを生み出す。このボトルネックは2つの要因から生じます。 (i)相対スコアを得るのに必要なペアワイズ比較のO(n^2)時間複雑性 2)繰り返しサンプリングや追加のチェーン・オブ・シークレット(CoT)の計算オーバーヘッドは,性能向上に寄与する。第1の要因に対処するために、よく確立されたBradley-TerryモデルをGRPOに組み込んだ新しいRLフレームワークであるIntergroup Relative Preference Optimization (IRPO)を提案する。各応答に対してポイントワイズスコアを生成することにより、IRPOは、解釈可能性と微妙な報酬信号を保持しながら、RLトレーニング中に任意に多くの候補を評価することができる。実験の結果、IRPOは複数のベンチマークでポイントワイドGRM間でのSOTA(State-of-the-art)性能を実現し、現在のリードペアGRMと同等の性能を示した。さらに、IRPOは、訓練後評価においてペアワイズGRMよりも有意に優れていることを示す。

論文の概要: IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

関連論文リスト