Related papers: The statistical advantage of automatic NLG metrics at the system level

The statistical advantage of automatic NLG metrics at the system level

URL: http://arxiv.org/abs/2105.12437v2
Date: Fri, 13 Dec 2024 19:16:14 GMT
Title: The statistical advantage of automatic NLG metrics at the system level
Authors: Johnny Tian-Zheng Wei, Robin Jia,
Abstract summary: Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators.<n>We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap.<n>Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
Score: 23.12467573182206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .

Related papers

What should an AI assessor optimise for? [57.96463917842822]
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system. Here we address the question: is it always optimal to train the assessor for the target metric? We experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings.
arXiv Detail & Related papers (2025-02-01T08:41:57Z)
Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling [20.078602767179355]
Failure to properly account for errors in machine learning predictions renders standard statistical procedures invalid. We introduce bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
arXiv Detail & Related papers (2025-01-30T18:46:43Z)
Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator [6.532478490187084]
MESA employs a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment.<n>Using GPT-4o as its backbone, MESA achieves correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods.
arXiv Detail & Related papers (2024-11-27T15:35:32Z)
What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z)
Leveraging Variational Autoencoders for Parameterized MMSE Estimation [10.141454378473972]
We propose a variational autoencoder-based framework for parameterizing a conditional linear minimum mean squared error estimator. The derived estimator is shown to approximate the minimum mean squared error estimator by utilizing the variational autoencoder as a generative prior for the estimation problem. We conduct a rigorous analysis by bounding the difference between the proposed and the minimum mean squared error estimator.
arXiv Detail & Related papers (2023-07-11T15:41:34Z)
On Fairness and Stability: Is Estimator Variance a Friend or a Foe? [6.751310968561177]
We propose a new family of performance measures based on group-wise parity in variance. We develop and release an open-source library that reconciles uncertainty quantification techniques with fairness analysis.
arXiv Detail & Related papers (2023-02-09T09:35:36Z)
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z)
Analysis and Comparison of Classification Metrics [12.092755413404245]
Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk. We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
arXiv Detail & Related papers (2022-09-12T16:06:10Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
Newer is not always better: Rethinking transferability metrics, their peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular. We show that the statistical problems with covariance estimation drive the poor performance of H-score. We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z)
Expected Validation Performance and Estimation of a Random Variable's Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance. We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias. We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z)
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z)
Machine learning for causal inference: on the use of cross-fit estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties. We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE) When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.