R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
- URL: http://arxiv.org/abs/2505.02835v1
- Date: Mon, 05 May 2025 17:59:50 GMT
- Title: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
- Authors: Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang,
- Abstract summary: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs)<n>We propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods.<n>Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks.
- Score: 22.167272219418845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
Related papers
- Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner [31.033131727230277]
Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL)<n>We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training.<n>Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines.
arXiv Detail & Related papers (2025-07-31T07:54:58Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Reward Reasoning Model [104.39256985858428]
Reward Reasoning Models (RRMs) are designed to execute a deliberate reasoning process before generating final rewards.<n>We implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities.<n> Notably, RRMs can adaptively exploit test-time compute to further improve reward accuracy.
arXiv Detail & Related papers (2025-05-20T17:58:03Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
We introduce a new class of generative reward models -- Reward Reasoning Models (ReasRMs)<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple benchmarks.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Adversarial Training of Reward Models [74.17196154247964]
We introduce Adv-RM, a novel adversarial training framework that automatically identifies adversarial examples.<n>By leveraging reinforcement learning, Adv-RM trains a policy to expose vulnerabilities in large state-of-the-art reward models.<n>We demonstrate that Adv-RM significantly outperforms conventional reward training.
arXiv Detail & Related papers (2025-04-08T15:38:25Z) - Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs)<n>We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards.<n>We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z) - Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs [58.18140409409302]
Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL)<n>Applying RL in broader domains like chatbots and content generation presents unique challenges.<n>We show a case study of reproducing existing reward model ensemble research using embedding-based reward models.
arXiv Detail & Related papers (2025-02-04T19:37:35Z) - On Designing Effective RL Reward at Training Time for LLM Reasoning [14.006845442313134]
We evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM)<n>Surprisingly, even though these learned reward models have strong inference-time performances, they may NOT help or even hurt RL training.<n>We introduce two novel reward refinement techniques, including Clipping and Delta.
arXiv Detail & Related papers (2024-10-19T13:53:50Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - Let's Reinforce Step by Step [10.65244642965387]
We use Reinforcement Learning from Human Feedback to shape model reasoning processes.
Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning.
We also show the critical role reward aggregation functions play in model performance.
arXiv Detail & Related papers (2023-11-10T01:35:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.