Related papers: IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

URL: http://arxiv.org/abs/2601.00677v1
Date: Fri, 02 Jan 2026 12:57:06 GMT
Title: IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning
Authors: Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang,
Abstract summary: Intergroup Relative Preference Optimization (IRPO) is a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO.<n>By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training.<n> Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs.
Score: 11.499402258204375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning [7.6602542594279335]
We propose Reinforcement Learning with Relative Rewards to shift reward shaping from absolute scoring to relative ranking.<n>We show that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
arXiv Detail & Related papers (2026-01-30T15:07:06Z)
TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization [32.17940023097263]
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval.<n>Current reinforcement learning (RL) frameworks for search-augmented reasoning rely on sparse outcome-level rewards.<n>We propose Turn-level Stage-aware Policy Optimization (TSPO) to address this problem.
arXiv Detail & Related papers (2026-01-30T09:58:45Z)
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z)
ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning [17.98065634130798]
We propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO)<n>ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt.<n>We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors.
arXiv Detail & Related papers (2025-11-26T03:10:15Z)
Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization [20.004150645050537]
Group Turn Policy Optimization (GTPO) is a novel reinforcement learning algorithm designed for training Large Language Models (LLMs) on multi-turn Tool-Integrated Reasoning tasks.<n>GTPO introduces three key innovations: turn-level reward assignment that provides fine-grained feedback for individual turns, return-based advantage estimation, and self-supervised reward shaping.<n>Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks.
arXiv Detail & Related papers (2025-11-18T19:01:16Z)
Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS [62.22644307952087]
We introduce AIRL-S, the first natural unification of RL-based and search-based TTS.<n>We leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces.<n>Our results show that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o.
arXiv Detail & Related papers (2025-08-19T23:41:15Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.