URPO: A Unified Reward & Policy Optimization Framework for Large Language Models
- URL: http://arxiv.org/abs/2507.17515v1
- Date: Wed, 23 Jul 2025 13:52:27 GMT
- Title: URPO: A Unified Reward & Policy Optimization Framework for Large Language Models
- Authors: Songshuo Lu, Hua Wang, Zhi Chen, Yaohua Tang,
- Abstract summary: We propose a novel framework that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase.<n>Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format.<n>Experiments on the Qwen2.5-7B model demonstrate URPO's superiority.
- Score: 10.511836918064724
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO's superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.
Related papers
- Activation Reward Models for Few-Shot Model Alignment [77.37511364793515]
We introduce Activation Reward Models (Activation RMs)<n>Activation RMs leverage activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning.<n>We demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications.
arXiv Detail & Related papers (2025-07-02T05:10:29Z) - A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization [18.892740849961456]
Reinforcement Learning from Human Feedback (RLHF) has emerged as an important paradigm for aligning large language models with human preferences during post-training.<n>This paper introduces Pairwise-RL, a RLHF framework that addresses these challenges through a combination of generative reward modeling and a pairwise proximal policy optimization algorithm.
arXiv Detail & Related papers (2025-04-07T11:34:48Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - HAF-RM: A Hybrid Alignment Framework for Reward Model Training [51.59246299566669]
We propose a hybrid alignment framework HaF-RM for reward model training.<n>It offers a principled and effective approach to enhancing the performance and alignment of reward models.
arXiv Detail & Related papers (2024-07-04T23:26:56Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data.<n>We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs.<n>Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - SimPO: Simple Preference Optimization with a Reference-Free Reward [43.136307294076545]
Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm.
We propose SimPO, a simpler yet more effective approach to DPO.
SimPO consistently and significantly outperforms DPO without substantially increasing response length.
arXiv Detail & Related papers (2024-05-23T16:01:46Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences [21.5605000515622]
This paper studies post-training large language models (LLMs) using preference feedback from an oracle to help a model iteratively improve over itself.
We introduce Direct Nash Optimization (DNO), a provable and efficient algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences.
In our experiments, a resulting 7B parameter Orca-2.5 model achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaE 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model
arXiv Detail & Related papers (2024-04-04T17:56:41Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - West-of-N: Synthetic Preferences for Self-Improving Reward Models [20.643537269666137]
We present a novel approach to improve reward model quality by generating synthetic preference data.
We find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data.
arXiv Detail & Related papers (2024-01-22T16:24:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.