Quantified Reproducibility Assessment of NLP Results
- URL: http://arxiv.org/abs/2204.05961v1
- Date: Tue, 12 Apr 2022 17:22:46 GMT
- Title: Quantified Reproducibility Assessment of NLP Results
- Authors: Anya Belz, Maja Popovi\'c and Simon Mille
- Abstract summary: This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology.
We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results.
The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
- Score: 5.181381829976355
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper describes and tests a method for carrying out quantified
reproducibility assessment (QRA) that is based on concepts and definitions from
metrology. QRA produces a single score estimating the degree of reproducibility
of a given system and evaluation measure, on the basis of the scores from, and
differences between, different reproductions. We test QRA on 18 system and
evaluation measure combinations (involving diverse NLP tasks and types of
evaluation), for each of which we have the original results and one to seven
reproduction results. The proposed QRA method produces
degree-of-reproducibility scores that are comparable across multiple
reproductions not only of the same, but of different original studies. We find
that the proposed method facilitates insights into causes of variation between
reproductions, and allows conclusions to be drawn about what changes to system
and/or evaluation design might lead to improved reproducibility.
Related papers
- ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations [16.591822946975547]
This paper reproduces the findings of NLP research regarding human evaluation.
The results lend support to the original findings, with similar patterns seen between the original work and our reproduction.
arXiv Detail & Related papers (2024-04-26T15:31:25Z) - With a Little Help from the Authors: Reproducing Human Evaluation of an
MT Error Detector [4.636982694364995]
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations.
Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving.
arXiv Detail & Related papers (2023-08-12T11:00:59Z) - A Covariate-Adjusted Homogeneity Test with Application to Facial
Recognition Accuracy Assessment [0.3222802562733786]
Ordinal scores occur commonly in medical imaging studies and in black-box forensic studies.
Our proposed test is applied to a face recognition study to identify statistically significant differences among five participant groups.
arXiv Detail & Related papers (2023-07-17T21:16:26Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.
We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.
Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z) - Rethinking and Refining the Distinct Metric [61.213465863627476]
We refine the calculation of distinct scores by re-scaling the number of distinct tokens based on its expectation.
We provide both empirical and theoretical evidence to show that our method effectively removes the biases exhibited in the original distinct score.
arXiv Detail & Related papers (2022-02-28T07:36:30Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - How to Evaluate a Summarizer: Study Design and Statistical Analysis for
Manual Linguistic Quality Evaluation [3.624563211765782]
We show that best choice of evaluation method can vary from one aspect to another.
We show that the total number of annotators can have a strong impact on study power.
Current statistical analysis methods can inflate type I error rates up to eight-fold.
arXiv Detail & Related papers (2021-01-27T10:14:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.