Related papers: More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

URL: http://arxiv.org/abs/2404.18870v2
Date: Sat, 21 Dec 2024 22:56:04 GMT
Title: More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
Authors: Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju,
Abstract summary: This study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals.<n>Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed.<n>We propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks.
Score: 24.843692458375436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn't been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn't automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks, and show its feasibility by providing our estimated attribution scores. Together, our results underscore the need for more nuanced approaches for model alignment from both the data and framework perspectives, and we hope this research will guide the community towards developing language models that are increasingly capable without sacrificing trustworthiness.

Related papers

Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [63.116041268654705]
We find that different internal reward models within the same Large Language Models often generate inconsistent preferences. This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research. We propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training.
arXiv Detail & Related papers (2025-02-13T03:15:31Z)
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment [30.605500809158986]
We propose a novel causal reward modeling approach that integrates causal inference to mitigate spurious correlations. Our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences.
arXiv Detail & Related papers (2025-01-16T16:00:37Z)
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models [46.09562860220433]
We introduce GazeReward, a novel framework that integrates implicit feedback -- and specifically eye-tracking (ET) data -- into the Reward Model (RM) Our approach significantly improves the accuracy of the RM on established human preference datasets.
arXiv Detail & Related papers (2024-10-02T13:24:56Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern. We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
Automated Trustworthiness Testing for Machine Learning Classifiers [3.3423762257383207]
This paper proposes TOWER, the first technique to automatically create trustworthiness oracles that determine whether text classifier predictions are trustworthy. Our hypothesis is that a prediction is trustworthy if the words in its explanation are semantically related to the predicted class. The results show that TOWER can detect a decrease in trustworthiness as noise increases, but is not effective when evaluated against the human-labeled dataset.
arXiv Detail & Related papers (2024-06-07T20:25:05Z)
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO) XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z)
Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models [14.5291643644017]
We introduce the concept of Confidence-Probability Alignment. We probe the alignment between models' internal and expressed confidence. Among the models analyzed, OpenAI's GPT-4 showed the strongest confidence-probability alignment.
arXiv Detail & Related papers (2024-05-25T15:42:04Z)
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z)
Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset. We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z)
Aligning Large Language Models with Human Preferences through Representation Engineering [41.81020951061438]
Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.
arXiv Detail & Related papers (2023-12-26T11:01:36Z)
Learning for Counterfactual Fairness from Observational Data [62.43249746968616]
Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age. A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data. In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE.
arXiv Detail & Related papers (2023-07-17T04:08:29Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z)
The Connection between Out-of-Distribution Generalization and Privacy of ML Models [11.580603875423408]
We show that a lower OOD generalization gap does not imply better robustness to MI attacks. A model that captures stable features is more robust to MI attacks than models that exhibit better OOD generalization but do not learn stable features. For the same provable differential privacy guarantees, a model that learns stable features provides higher utility as compared to others.
arXiv Detail & Related papers (2021-10-07T12:05:25Z)
Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition [25.595147432155642]
This paper proposes two approaches to improve the model-based confidence estimators on out-of-domain data. Experiments show that the proposed methods can significantly improve the confidence metrics on TED-LIUM and Switchboard datasets.
arXiv Detail & Related papers (2021-10-07T10:44:27Z)
Decomposed Adversarial Learned Inference [118.27187231452852]
We propose a novel approach, Decomposed Adversarial Learned Inference (DALI) DALI explicitly matches prior and conditional distributions in both data and code spaces. We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets.
arXiv Detail & Related papers (2020-04-21T20:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.