Better than Average: Paired Evaluation of NLP Systems
- URL: http://arxiv.org/abs/2110.10746v1
- Date: Wed, 20 Oct 2021 19:40:31 GMT
- Title: Better than Average: Paired Evaluation of NLP Systems
- Authors: Maxime Peyrard, Wei Zhao, Steffen Eger, Robert West
- Abstract summary: We show the importance of taking the instance-level pairing of evaluation scores into account.
We release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill)
- Score: 31.311553903738798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluation in NLP is usually done by comparing the scores of competing
systems independently averaged over a common set of test instances. In this
work, we question the use of averages for aggregating evaluation scores into a
final number used to decide which system is best, since the average, as well as
alternatives such as the median, ignores the pairing arising from the fact that
systems are evaluated on the same test instances. We illustrate the importance
of taking the instance-level pairing of evaluation scores into account and
demonstrate, both theoretically and empirically, the advantages of aggregation
methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a
mechanism based on the estimated probability that a given system scores better
than another on the test set. By re-evaluating 296 real NLP evaluation setups
across four tasks and 18 evaluation metrics, we show that the choice of
aggregation mechanism matters and yields different conclusions as to which
systems are state of the art in about 30% of the setups. To facilitate the
adoption of pairwise evaluation, we release a practical tool for performing the
full analysis of evaluation scores with the mean, median, BT, and two variants
of BT (Elo and TrueSkill), alongside functionality for appropriate statistical
testing.
Related papers
- Active Evaluation Acquisition for Efficient LLM Benchmarking [18.85604491151409]
We investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy.
Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples.
Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required.
arXiv Detail & Related papers (2024-10-08T12:08:46Z) - Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment.
Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.