How Human is Human Evaluation? Improving the Gold Standard for NLG with
Utility Theory
- URL: http://arxiv.org/abs/2205.11930v1
- Date: Tue, 24 May 2022 09:51:27 GMT
- Title: How Human is Human Evaluation? Improving the Gold Standard for NLG with
Utility Theory
- Authors: Kawin Ethayarajh, Dan Jurafsky
- Abstract summary: We propose a new evaluation protocol called $textitsystem-level probabilistic assessment$ (SPA)
We find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones -- as expected -- with all comparisons being statistically significant.
In our experiments, we find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones -- as expected -- with all comparisons being statistically significant.
- Score: 47.10283773005394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human ratings are treated as the gold standard in NLG evaluation. The
standard protocol is to collect ratings of generated text, average across
annotators, and then rank NLG systems by their average scores. However, little
consideration has been given as to whether this approach faithfully captures
human preferences. In this work, we analyze this standard protocol through the
lens of utility theory in economics. We first identify the implicit assumptions
it makes about annotators and find that these assumptions are often violated in
practice, in which case annotator ratings become an unfaithful reflection of
their preferences. The most egregious violations come from using Likert scales,
which provably reverse the direction of the true preference in certain cases.
We suggest improvements to the standard protocol to make it more theoretically
sound, but even in its improved form, it cannot be used to evaluate open-ended
tasks like story generation. For the latter, we propose a new evaluation
protocol called $\textit{system-level probabilistic assessment}$ (SPA). In our
experiments, we find that according to SPA, annotators prefer larger GPT-3
variants to smaller ones -- as expected -- with all comparisons being
statistically significant. In contrast, the standard protocol only yields
significant results half the time.
Related papers
- Reevaluation of Inductive Link Prediction [9.955225436683959]
We show that the evaluation protocol currently used for inductive link prediction is heavily flawed.
Due to the limited size of the set of negatives, a simple rule-based baseline can achieve state-of-the-art results.
arXiv Detail & Related papers (2024-09-30T09:32:10Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z) - Peering Through Preferences: Unraveling Feedback Acquisition for
Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models.
We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators.
Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z) - On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top-$n$ Recommendation [12.036747050794135]
Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies.
We show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated.
arXiv Detail & Related papers (2023-07-27T17:57:42Z) - Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context.
This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other.
We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z) - Doubly-Robust Estimation for Unbiased Learning-to-Rank from
Position-Biased Click Feedback [13.579420996461439]
We introduce a novel DR estimator that uses the expectation of treatment per rank instead of IPS estimation.
Our results indicate it requires several orders of magnitude fewer datapoints to converge at optimal performance.
arXiv Detail & Related papers (2022-03-31T15:38:25Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Consistent Instance False Positive Improves Fairness in Face Recognition [46.55971583252501]
Existing methods heavily rely on accurate demographic annotations.
These methods are typically designed for a specific demographic group and are not general enough.
We propose a false positive rate penalty loss, which mitigates face recognition bias by increasing the consistency of instance False Positive Rate.
arXiv Detail & Related papers (2021-06-10T06:20:37Z) - Evaluating Large-Vocabulary Object Detectors: The Devil is in the
Details [107.2722027807328]
We find that the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors.
We show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin.
We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation.
arXiv Detail & Related papers (2021-02-01T18:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.