Human Feedback is not Gold Standard
- URL: http://arxiv.org/abs/2309.16349v2
- Date: Tue, 16 Jan 2024 16:51:33 GMT
- Title: Human Feedback is not Gold Standard
- Authors: Tom Hosking, Phil Blunsom, Max Bartolo
- Abstract summary: We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
- Score: 28.63384327791185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human feedback has become the de facto standard for evaluating the
performance of Large Language Models, and is increasingly being used as a
training objective. However, it is not clear which properties of a generated
output this single `preference' score captures. We hypothesise that preference
scores are subjective and open to undesirable biases. We critically analyse the
use of human feedback for both training and evaluation, to verify whether it
fully captures a range of crucial error criteria. We find that while preference
scores have fairly good coverage, they under-represent important aspects like
factuality. We further hypothesise that both preference scores and error
annotation may be affected by confounders, and leverage instruction-tuned
models to generate outputs that vary along two possible confounding dimensions:
assertiveness and complexity. We find that the assertiveness of an output skews
the perceived rate of factuality errors, indicating that human annotations are
not a fully reliable evaluation metric or training objective. Finally, we offer
preliminary evidence that using human feedback as a training objective
disproportionately increases the assertiveness of model outputs. We encourage
future work to carefully consider whether preference scores are well aligned
with the desired objective.
Related papers
- Semi-supervised Learning For Robust Speech Evaluation [30.593420641501968]
Speech evaluation measures a learners oral proficiency using automatic models.
This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization.
An anchor model is trained using pseudo labels to predict the correctness of pronunciation.
arXiv Detail & Related papers (2024-09-23T02:11:24Z) - Fighting Sampling Bias: A Framework for Training and Evaluating Credit Scoring Models [2.918530881730374]
This paper addresses the adverse effect of sampling bias on model training and evaluation.
We propose bias-aware self-learning and a reject inference framework for scorecard evaluation.
Our results suggest a profit improvement of about eight percent, when using Bayesian evaluation to decide on acceptance rates.
arXiv Detail & Related papers (2024-07-17T20:59:54Z) - Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model [75.66013048128302]
In this work, we investigate the potential of employing the QE model as the reward model to predict human preferences for feedback training.
We first identify the overoptimization problem during QE-based feedback training, manifested as an increase in reward while translation quality declines.
To address the problem, we adopt a simple yet effective method that uses rules to detect the incorrect translations and assigns a penalty term to the reward scores of them.
arXiv Detail & Related papers (2024-01-23T16:07:43Z) - Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning
from Human Feedback [55.78118035358662]
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values.
We have identified that the reward model often finds shortcuts to bypass its intended objectives.
We propose an innovative solution, applying the Product-of-Experts technique to separate reward modeling from the influence of sequence length.
arXiv Detail & Related papers (2023-10-08T15:14:39Z) - Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z) - Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks.
We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations.
We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z) - Debiased Explainable Pairwise Ranking from Implicit Feedback [0.3867363075280543]
We focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR)
BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations.
We propose a novel explainable loss function and a corresponding Matrix Factorization-based model that generates recommendations along with item-based explanations.
arXiv Detail & Related papers (2021-07-30T17:19:37Z) - Dialogue Response Ranking Training with Large-Scale Human Feedback Data [52.12342165926226]
We leverage social media feedback data to build a large-scale training dataset for feedback prediction.
We trained DialogRPT, a set of GPT-2 based models on 133M pairs of human feedback data.
Our ranker outperforms the conventional dialog perplexity baseline with a large margin on predicting Reddit feedback.
arXiv Detail & Related papers (2020-09-15T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.