User-centric Subjective Leaderboard by Customizable Reward Modeling
- URL: http://arxiv.org/abs/2508.09463v1
- Date: Wed, 13 Aug 2025 03:39:04 GMT
- Title: User-centric Subjective Leaderboard by Customizable Reward Modeling
- Authors: Qi Jia, Xiujie Song, Zicheng Zhang, Yijin Guo, Kaiwei Zhang, Zijian Chen, Guangtao Zhai,
- Abstract summary: We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
- Score: 34.40455169451943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing benchmarks for large language models (LLMs) predominantely focus on assessing their capabilities through verifiable tasks. Such objective and static benchmarks offer limited utility for practical LLM selection, making it difficult for users to find suitable models for their individual needs. To bridge this gap, we present the first User-Centric Subjective Leaderboard (USL), which provides a preference-driven, dynamic ranking of LLMs across diverse real-world scenarios. Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries. Our investigation reveals significant diversity and contradictions in human preferences, which limit the effectiveness of state-of-the-art reward models. To address this, we introduce Customizable Reward Models (CRMs). With only 4B parameters, our CRM surpasses the performance of leading models such as GPT-4.1 and Gemini-2.5-pro, showing exceptional generalization capabilities across new topics and criteria. The USL, powered by CRMs, exhibits strong negative correlations to contradictory preferences.
Related papers
- CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria [48.70940362676624]
We propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method.<n>Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks.
arXiv Detail & Related papers (2026-01-28T07:46:13Z) - One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z) - Approximating Human Preferences Using a Multi-Judge Learned System [35.18016233072556]
We propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges.<n>Our contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator.
arXiv Detail & Related papers (2025-10-29T18:32:53Z) - Benchmarking and Improving LLM Robustness for Personalized Generation [42.26075952121524]
We define a model as robust if its responses are both factually accurate and align with the user preferences.<n>Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned deployments.
arXiv Detail & Related papers (2025-09-18T13:56:14Z) - PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning [7.899605480166484]
PersRM-R1 is the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars.<n>Our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning.<n> Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability.
arXiv Detail & Related papers (2025-08-12T14:25:58Z) - Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment [35.80989342492335]
noisy preferences in human feedback can lead to reward misgeneralization.<n>This paper aims to identify how noisy preferences differ from human-aligned preferences in reward modeling.<n>We propose an online Collaborative Reward Modeling framework to achieve robust preference learning.
arXiv Detail & Related papers (2025-05-15T10:58:20Z) - Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling [87.17041933863041]
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs)<n>We introduce a $textbfR$esponse-$textbfc$onditioned $textbfB$radley-$textbfT$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following.<n>We also propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization
arXiv Detail & Related papers (2025-02-02T14:50:25Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - Lifelong Personalized Low-Rank Adaptation of Large Language Models for Recommendation [50.837277466987345]
We focus on the field of large language models (LLMs) for recommendation.
We propose RecLoRA, which incorporates a Personalized LoRA module that maintains independent LoRAs for different users.
We also design a Few2Many Learning Strategy, using a conventional recommendation model as a lens to magnify small training spaces to full spaces.
arXiv Detail & Related papers (2024-08-07T04:20:28Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes.
Our analysis reveals that the impact of diversified human preferences depends on both model size and data size.
Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z) - Compositional preference models for aligning LMs [15.036426712762147]
Compositional Preference Models (CPMs) are a framework that decomposes one global preference assessment into several interpretable features.
CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment.
arXiv Detail & Related papers (2023-10-17T01:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.