The statistical advantage of automatic NLG metrics at the system level
- URL: http://arxiv.org/abs/2105.12437v1
- Date: Wed, 26 May 2021 09:53:57 GMT
- Title: The statistical advantage of automatic NLG metrics at the system level
- Authors: Johnny Tian-Zheng Wei and Robin Jia
- Abstract summary: Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators.
We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap.
Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected.
- Score: 10.540821585237222
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating the expected output quality of generation systems is central to
NLG. This paper qualifies the notion that automatic metrics are not as good as
humans in estimating system-level quality. Statistically, humans are unbiased,
high variance estimators, while metrics are biased, low variance estimators. We
compare these estimators by their error in pairwise prediction (which
generation system is better?) using the bootstrap. Measuring this error is
complicated: predictions are evaluated against noisy, human predicted labels
instead of the ground truth, and metric predictions fluctuate based on the test
sets they were calculated on. By applying a bias-variance-noise decomposition,
we adjust this error to a noise-free, infinite test set setting. Our analysis
compares the adjusted error of metrics to humans and a derived, perfect
segment-level annotator, both of which are unbiased estimators dependent on the
number of judgments collected. In MT, we identify two settings where metrics
outperform humans due to a statistical advantage in variance: when the number
of human judgments used is small, and when the quality difference between
compared systems is small. The data and code to reproduce our analyses are
available at https://github.com/johntzwei/metric-statistical-advantage .
Related papers
- What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Leveraging Variational Autoencoders for Parameterized MMSE Estimation [10.141454378473972]
We propose a variational autoencoder-based framework for parameterizing a conditional linear minimum mean squared error estimator.
The derived estimator is shown to approximate the minimum mean squared error estimator by utilizing the variational autoencoder as a generative prior for the estimation problem.
We conduct a rigorous analysis by bounding the difference between the proposed and the minimum mean squared error estimator.
arXiv Detail & Related papers (2023-07-11T15:41:34Z) - On Fairness and Stability: Is Estimator Variance a Friend or a Foe? [6.751310968561177]
We propose a new family of performance measures based on group-wise parity in variance.
We develop and release an open-source library that reconciles uncertainty quantification techniques with fairness analysis.
arXiv Detail & Related papers (2023-02-09T09:35:36Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - Analysis and Comparison of Classification Metrics [12.092755413404245]
Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk.
We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
arXiv Detail & Related papers (2022-09-12T16:06:10Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Newer is not always better: Rethinking transferability metrics, their
peculiarities, stability and performance [5.650647159993238]
Fine-tuning of large pre-trained image and language models on small customized datasets has become increasingly popular.
We show that the statistical problems with covariance estimation drive the poor performance of H-score.
We propose a correction and recommend measuring correlation performance against relative accuracy in such settings.
arXiv Detail & Related papers (2021-10-13T17:24:12Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.