Related papers: SEAL: Systematic Error Analysis for Value ALignment

SEAL: Systematic Error Analysis for Value ALignment

URL: http://arxiv.org/abs/2408.10270v1
Date: Fri, 16 Aug 2024 18:48:30 GMT
Title: SEAL: Systematic Error Analysis for Value ALignment
Authors: Manon Revel, Matteo Cargnelutti, Tyna Eloundou, Greg Leppert,
Abstract summary: Reinforcement Learning from Human Feedback aims to align language models with human values. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values.
Score: 4.2185937778110825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanisms of RLHF remain poorly understood. This paper introduces new metrics to evaluate the effectiveness of modeling and aligning human values, namely feature imprint, alignment resistance and alignment robustness. We categorize alignment datasets into target features (desired values) and spoiler features (undesired concepts). By regressing RM scores against these features, we quantify the extent to which RMs reward them - a metric we term feature imprint. We define alignment resistance as the proportion of the preference dataset where RMs fail to match human preferences, and we assess alignment robustness by analyzing RM responses to perturbed inputs. Our experiments, utilizing open-source components like the Anthropic/hh-rlhf preference dataset and OpenAssistant RMs, reveal significant imprints of target features and a notable sensitivity to spoiler features. We observed a 26% incidence of alignment resistance in portions of the dataset where LM-labelers disagreed with human preferences. Furthermore, we find that misalignment often arises from ambiguous entries within the alignment dataset. These findings underscore the importance of scrutinizing both RMs and alignment datasets for a deeper understanding of value alignment.

Related papers

Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases. In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities. We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z)
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective [33.19778298286475]
We argue that a latent causal value graph underlies the value dimensions of large language models (LLMs) and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.
arXiv Detail & Related papers (2024-12-31T18:12:05Z)
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment [44.84304822376291]
Reward models (RMs) guide the alignment of large language models (LLMs) We propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs.
arXiv Detail & Related papers (2024-10-13T16:06:54Z)
Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree? [46.396681032860414]
We investigate how differences in RM accuracy translate into gaps in optimized policy performance. We find that the way of measuring accuracy significantly impacts its ability to predict the final policy performance.
arXiv Detail & Related papers (2024-10-08T00:52:03Z)
Distribution Learning for Molecular Regression [10.96062816455682]
Distributional Mixture of Experts (DMoE) is a model-independent, and data-independent method for regression. We evaluate the performance of DMoE on different molecular property prediction datasets.
arXiv Detail & Related papers (2024-07-30T00:21:51Z)
Hummer: Towards Limited Competitive Preference Dataset [19.03597445162459]
We introduce a novel metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We present textttHummer and its fine-grained variant, textttHummer-F, as innovative pairwise preference datasets with reduced-conflict alignment objectives.
arXiv Detail & Related papers (2024-05-19T18:57:25Z)
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling [66.3072381478251]
Reward hacking, also termed reward overoptimization, remains a critical challenge. We propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective. We show that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets.
arXiv Detail & Related papers (2024-02-14T17:49:07Z)
Confronting Reward Model Overoptimization with Constrained RLHF [114.71591361764547]
We show that correlation between component RMs has a significant effect on the locations of these points. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers.
arXiv Detail & Related papers (2023-10-06T16:59:17Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Self-supervised Representation Learning with Relative Predictive Coding [102.93854542031396]
Relative Predictive Coding (RPC) is a new contrastive representation learning objective. RPC maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks.
arXiv Detail & Related papers (2021-03-21T01:04:24Z)
AttriMeter: An Attribute-guided Metric Interpreter for Person Re-Identification [100.3112429685558]
Person ReID systems only provide a distance or similarity when matching two persons. We propose an Attribute-guided Metric Interpreter, named AttriMeter, to semantically and quantitatively explain the results of CNN-based ReID models.
arXiv Detail & Related papers (2021-03-02T03:37:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.