A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods
- URL: http://arxiv.org/abs/2104.00054v1
- Date: Wed, 31 Mar 2021 18:28:14 GMT
- Title: A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods
- Authors: Daniel Deutsch, Rotem Dror, Dan Roth
- Abstract summary: We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
- Score: 60.04142561088524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of a summarization evaluation metric is quantified by calculating
the correlation between its scores and human annotations across a large number
of summaries. Currently, it is not clear how precise these correlation
estimates are, nor whether differences between two metrics' correlations
reflects a true difference or if it is due to random chance. In this work, we
address these two problems by proposing methods for calculating confidence
intervals and running hypothesis tests for correlations using two resampling
methods, bootstrapping and permutation. After evaluating which of the proposed
methods is most appropriate for summarization through two simulation
experiments, we analyze the results of applying these methods to several
different automatic evaluation metrics across three sets of human annotations.
We find that the confidence intervals are rather wide, demonstrating high
uncertainty in how reliable automatic metrics truly are. Further, although many
metrics fail to show statistical improvements over ROUGE, two recent works,
QAEval and BERTScore, do in some evaluation settings.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Quantifying Variance in Evaluation Benchmarks [34.12254884944099]
We measure variance in evaluation benchmarks, including seed variance across initialisations, and monotonicity during training.
We find that simple changes, such as framing choice tasks as completion tasks, can often reduce variance for smaller scale.
More involved methods inspired from human testing literature (such as item analysis and item response theory) struggle to meaningfully reduce variance.
arXiv Detail & Related papers (2024-06-14T17:59:54Z) - Mutual Wasserstein Discrepancy Minimization for Sequential
Recommendation [82.0801585843835]
We propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation.
We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement.
arXiv Detail & Related papers (2023-01-28T13:38:48Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - How to Evaluate a Summarizer: Study Design and Statistical Analysis for
Manual Linguistic Quality Evaluation [3.624563211765782]
We show that best choice of evaluation method can vary from one aspect to another.
We show that the total number of annotators can have a strong impact on study power.
Current statistical analysis methods can inflate type I error rates up to eight-fold.
arXiv Detail & Related papers (2021-01-27T10:14:15Z) - Uncertainty Quantification in Extreme Learning Machine: Analytical
Developments, Variance Estimates and Confidence Intervals [0.0]
Uncertainty quantification is crucial to assess prediction quality of a machine learning model.
Most methods proposed in the literature make strong assumptions on the data, ignore the randomness of input weights or neglect the bias contribution in confidence interval estimations.
This paper presents novel estimations that overcome these constraints and improve the understanding of ELM variability.
arXiv Detail & Related papers (2020-11-03T13:45:59Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.