Related papers: PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

URL: http://arxiv.org/abs/2508.14076v1
Date: Tue, 12 Aug 2025 14:25:58 GMT
Title: PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning
Authors: Mengdi Li, Guanqiao Chen, Xufeng Zhao, Haochen Wen, Shu Yang, Di Wang,
Abstract summary: PersRM-R1 is the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars.<n>Our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning.<n> Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability.
Score: 7.899605480166484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.

Related papers

SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling [60.55856973678002]
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning.<n>Existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs.<n>We propose textbfCALM, a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks.
arXiv Detail & Related papers (2025-10-05T13:38:31Z)
User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z)
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z)
RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z)
Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
arXiv Detail & Related papers (2025-02-02T14:50:25Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Semi-Supervised Reward Modeling via Iterative Self-Training [52.48668920483908]
We propose Semi-Supervised Reward Modeling (SSRM), an approach that enhances RM training using unlabeled data. We demonstrate that SSRM significantly improves reward models without incurring additional labeling costs. Overall, SSRM substantially reduces the dependency on large volumes of human-annotated data, thereby decreasing the overall cost and time involved in training effective reward models.
arXiv Detail & Related papers (2024-09-10T22:57:58Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.