Mutual-Taught for Co-adapting Policy and Reward Models
- URL: http://arxiv.org/abs/2506.06292v2
- Date: Tue, 10 Jun 2025 03:32:39 GMT
- Title: Mutual-Taught for Co-adapting Policy and Reward Models
- Authors: Tianyuan Shi, Canbin Huang, Fanqi Wan, Longguang Zhong, Ziyi Yang, Weizhou Shen, Xiaojun Quan, Ming Yan,
- Abstract summary: We propose Mutual-Taught, a self-training method that iteratively improves both the policy model and the reward model.<n> Experimental results demonstrate that this iterative approach leads to consistent improvements in both models.
- Score: 43.11214888109746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: During the preference optimization of large language models (LLMs), distribution shifts may arise between newly generated model samples and the data used to train the reward model (RM). This shift reduces the efficacy of the RM, which in turn negatively impacts the performance of the policy model (PM). To address this challenge, we propose Mutual-Taught, a self-training method that iteratively improves both the PM and RM without requiring additional human annotation. Our approach mirrors the expectation-maximization (EM) algorithm. In the E-step, the PM is updated using feedback from the current RM, guiding the PM toward a better approximation of the latent optimal preference distribution. In the M-step, we update the RM by constructing training data from the outputs of the PM before and after the E-step update. This process ensures that the RM adapts to the evolving policy distribution. Experimental results demonstrate that this iterative approach leads to consistent improvements in both models. Specifically, our 8B policy model, LLaMA-3-8B-Instruct-MT, achieves a length-controlled win rate of 54.1\% on AlpacaEval-2, while our 8B reward model, FsfairX-LLaMA3-RM-MT, performs on par with GPT-4o-2024-08-06 on RewardBench.
Related papers
- Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback [52.1410307583181]
We useReinforcement Learning from Human Feedback to train language models (LMs) to follow complex human preferences.<n>As training progresses, the responses generated by the LM no longer resemble the responses seen by the reward model (RM)<n>We propose Off-Policy Corrected Reward Modeling to correct the RM using importance weighting, without requiring new labels or samples.
arXiv Detail & Related papers (2025-07-21T11:19:04Z) - Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model [56.92219181993453]
We propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix) to enable on-policyRFT methods like PPO and GRPO to leverage off-policy data.<n>ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady improvement.
arXiv Detail & Related papers (2025-07-09T14:29:45Z) - On the Robustness of Reward Models for Language Model Alignment [9.804782604188656]
We study the cause of over-optimization in reward models trained with the Bradley-Terry (BT) model.<n>We show that the excessive dispersion of hidden state norms is the main source of over-optimization.<n>We apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale.
arXiv Detail & Related papers (2025-05-12T06:48:26Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Entropy-Regularized Process Reward Model [30.279394036823092]
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning.<n>We propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP)<n>Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models.
arXiv Detail & Related papers (2024-12-15T01:09:23Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF [33.197077764166536]
Reinforcement Learning from Human Feedback involves training policy models (PMs) and reward models (RMs) to align language models with human preferences.
Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning.
Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress.
Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM.
arXiv Detail & Related papers (2024-06-12T07:52:17Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.