Related papers: Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

URL: http://arxiv.org/abs/2410.03775v1
Date: Thu, 3 Oct 2024 03:08:29 GMT
Title: Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge
Authors: Aparna Elangovan, Jongwoo Ko, Lei Xu, Mahsa Elyasi, Ling Liu, Sravan Bodapati, Dan Roth,
Abstract summary: We show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods. We propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance.
Score: 51.93909886542317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The effectiveness of automatic evaluation of generative models is typically measured by comparing it to human evaluation using correlation metrics. However, metrics like Krippendorff's $\alpha$ and Randolph's $\kappa$, originally designed to measure the reliability of human labeling, make assumptions about human behavior and the labeling process. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human labels (gathered during human evaluation) is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to human-to-human (HH) correlation. This can create the misleading impression that automatic evaluation is accurate enough to approximate the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine labels and human majority labels declines, falling below HH correlation. Based on these findings, we first propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric - *binned Jensen-Shannon Divergence for perception* for such scenarios to better measure the effectiveness of automatic evaluations. Third, we present visualization techniques -- *perception charts*, to compare the strengths and limitations of automatic evaluation and to contextualize correlation measures appropriately

Related papers

"All that Glitters": Approaches to Evaluations with Unreliable Model and Human Annotations [0.0]
"Gold" and "ground truth" human-mediated labels have error. This study demonstrates methods for answering such questions even in the context of very low reliabilities from expert humans.
arXiv Detail & Related papers (2024-11-23T19:18:08Z)
Learning with Confidence: Training Better Classifiers from Soft Labels [0.0]
In supervised machine learning, models are typically trained using data with hard labels, i.e., definite assignments of class membership. We investigate whether incorporating label uncertainty, represented as discrete probability distributions over the class labels, improves the predictive performance of classification models.
arXiv Detail & Related papers (2024-09-24T13:12:29Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation [15.8765167340819]
Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem.
arXiv Detail & Related papers (2023-09-30T20:54:59Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
Improving Classifier Robustness through Active Generation of Pairwise Counterfactuals [22.916599410472102]
We present a novel framework that utilizes counterfactual generative models to generate a large number of diverse counterfactuals. We show that with a small amount of human-annotated counterfactual data (10%), we can generate a counterfactual augmentation dataset with learned labels.
arXiv Detail & Related papers (2023-05-22T23:19:01Z)
The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation [21.513743126525622]
We argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward.
arXiv Detail & Related papers (2022-11-04T16:38:09Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
Multi-label Classification with High-rank and High-order Label Correlations [62.39748565407201]
Previous methods capture the high-order label correlations mainly by transforming the label matrix to a latent label space with low-rank matrix factorization. We propose a simple yet effective method to depict the high-order label correlations explicitly, and at the same time maintain the high-rank of the label matrix. Comparative studies over twelve benchmark data sets validate the effectiveness of the proposed algorithm in multi-label classification.
arXiv Detail & Related papers (2022-07-09T05:15:31Z)
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
An Empirical Investigation of Learning from Biased Toxicity Labels [15.822714574671412]
We study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels. We evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two.
arXiv Detail & Related papers (2021-10-04T17:19:57Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Exploiting Sample Uncertainty for Domain Adaptive Person Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels. Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.