EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences
- URL: http://arxiv.org/abs/2510.06370v2
- Date: Thu, 09 Oct 2025 20:13:47 GMT
- Title: EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences
- Authors: Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap,
- Abstract summary: Large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences of users worldwide.<n>ESTEER is a benchmark to measure models' steerability towards users' value and stylistic preference profiles.
- Score: 43.42154744611566
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under eleven systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.
Related papers
- Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions [50.70965714314064]
Large Language Models (LLMs) are increasingly serving as personal assistants, where users share complex and diverse preferences over extended interactions.<n>This work proposes RealPref, a benchmark for evaluating realistic preference-following in personalized user-LLM interactions.
arXiv Detail & Related papers (2026-03-04T15:42:43Z) - MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values [23.048484118383616]
We introduce MVPBench, a novel benchmark that evaluates large language models' alignment with human values across 75 countries.<n>Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines.<n>Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs.
arXiv Detail & Related papers (2025-09-09T09:25:08Z) - User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z) - Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression [9.624392327607833]
Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback.<n>We propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences.
arXiv Detail & Related papers (2025-08-11T22:40:31Z) - HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages [61.28284565925745]
We introduce HelpSteer3-Preference, a high-quality, human-annotated preference dataset comprising of over 40,000 samples.<n>Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%)<n>This represents a substantial improvement (10% absolute) over the previously best-reported results from existing RMs.
arXiv Detail & Related papers (2025-05-16T17:31:19Z) - FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users [111.56469697145519]
We propose Few-Shot Preference Optimization, which reframes reward modeling as a meta-learning problem.<n>Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them.<n>We generate over 1M synthetic personalized preferences using publicly available LLMs.<n>We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study.
arXiv Detail & Related papers (2025-02-26T17:08:46Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z) - CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses [34.77031649891843]
We introduce CLAVE, a novel framework which integrates two complementary Large Language Models (LLMs)
This dual-model approach enables calibration with any value systems using 100 human-labeled samples per value type.
We present ValEval, a comprehensive dataset comprising 13k+ (text,value,label) 12+s across diverse domains, covering three major value systems.
arXiv Detail & Related papers (2024-07-15T13:51:37Z) - CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System [16.84754752395103]
This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems.<n>We introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness.<n>To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM.
arXiv Detail & Related papers (2024-03-08T20:44:59Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes.
Our analysis reveals that the impact of diversified human preferences depends on both model size and data size.
Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.