Related papers: Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

URL: http://arxiv.org/abs/2410.05584v2
Date: Tue, 15 Oct 2024 04:50:47 GMT
Title: Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?
Authors: Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, Le Sun,
Abstract summary: We investigate how differences in RM accuracy translate into gaps in optimized policy performance. We find that the way of measuring accuracy significantly impacts its ability to predict the final policy performance.
Score: 46.396681032860414
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of Regressional Goodhart's effect, we identify the existence of exogenous variables impacting the relationship between RM quality measured by accuracy and policy model capability. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.

Related papers

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified? [11.534630666670568]
Spurious correlations, unstable statistical shortcuts a model can exploit, are expected to degrade performance out-of-distribution.<n>We show that current practice evaluates "robustness" without truly stressing the spurious signals we seek to eliminate.
arXiv Detail & Related papers (2025-03-31T19:50:04Z)
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs) We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z)
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown [20.753374166695494]
We introduce the Uncertainty-aware Reward Model (URM) and its ensemble variant, URME. URM employs a probabilistic value head to capture aleatoric uncertainty by modeling the distribution of disentangled human preference attributes. URME further quantifies uncertainty by examining discrepancies among individual URMs within the ensemble, enabling identification of unreliable evaluations.
arXiv Detail & Related papers (2024-10-01T16:29:59Z)
SEAL: Systematic Error Analysis for Value ALignment [4.2185937778110825]
Reinforcement Learning from Human Feedback aims to align language models with human values. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values.
arXiv Detail & Related papers (2024-08-16T18:48:30Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Are We Really Achieving Better Beyond-Accuracy Performance in Next Basket Recommendation? [57.91114305844153]
Next basket recommendation (NBR) is a special type of sequential recommendation that is increasingly receiving attention. Recent studies into NBR have found a substantial performance difference between recommending repeat items and explore items. We propose a plug-and-play two-step repetition-exploration framework that treats repeat items and explores items separately.
arXiv Detail & Related papers (2024-05-02T09:59:35Z)
Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z)
Machine Learning Simulates Agent-Based Model Towards Policy [0.0]
We use a random forest machine learning algorithm to emulate an agent-based model (ABM) and evaluate competing policies across 46 Metropolitan Regions (MRs) in Brazil. As a result, we obtain the optimal (and non-optimal) performance of each region over the policies. Results suggest that MRs already have embedded structures that favor optimal or non-optimal results, but they also illustrate which policy is more beneficial to each place.
arXiv Detail & Related papers (2022-03-04T21:19:11Z)
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Understanding the Effects of Adversarial Personalized Ranking Optimization Method on Recommendation Quality [6.197934754799158]
We model the learning characteristics of the Bayesian Personalized Ranking (BPR) and APR optimization frameworks. We show that APR amplifies the popularity bias more than BPR due to an unbalanced number of received positive updates from short-head items.
arXiv Detail & Related papers (2021-07-29T10:22:20Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Stochastic Optimization of Areas Under Precision-Recall Curves with Provable Convergence [66.83161885378192]
Area under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems. We propose a technical method to optimize AUPRC for deep learning.
arXiv Detail & Related papers (2021-04-18T06:22:21Z)
Beyond Marginal Uncertainty: How Accurately can Bayesian Regression Models Estimate Posterior Predictive Correlations? [13.127549105535623]
It is often more useful to estimate predictive correlations between the function values at different input locations. We first consider a downstream task which depends on posterior predictive correlations: transductive active learning (TAL) Since TAL is too expensive and indirect to guide development of algorithms, we introduce two metrics which more directly evaluate the predictive correlations.
arXiv Detail & Related papers (2020-11-06T03:48:59Z)
Strategy for Boosting Pair Comparison and Improving Quality Assessment Accuracy [29.849156371902943]
Pair Comparison (PC) is of significant advantage over Absolute Category Rating (ACR) in terms of discriminability. In this study, we employ a generic model to bridge the pair comparison data and ACR data, where the variance term could be recovered and the obtained information is more complete. In such a way, the proposed methodology could achieve the same accuracy of pair comparison but with the compelxity as low as ACR.
arXiv Detail & Related papers (2020-10-01T13:05:09Z)
Decomposed Adversarial Learned Inference [118.27187231452852]
We propose a novel approach, Decomposed Adversarial Learned Inference (DALI) DALI explicitly matches prior and conditional distributions in both data and code spaces. We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets.
arXiv Detail & Related papers (2020-04-21T20:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.