Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP
- URL: http://arxiv.org/abs/2305.01633v2
- Date: Mon, 7 Aug 2023 09:54:55 GMT
- Title: Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP
- Authors: Anya Belz, Craig Thomson, Ehud Reiter, Gavin Abercrombie, Jose M.
Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Mark Cieliebak, Elizabeth
Clark, Kees van Deemter, Tanvi Dinkar, Ond\v{r}ej Du\v{s}ek, Steffen Eger,
Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier
Gonz\'alez-Corbelle, Dirk Hovy, Manuela H\"urlimann, Takumi Ito, John D.
Kelleher, Filip Klubicka, Emiel Krahmer, Huiyuan Lai, Chris van der Lee, Yiru
Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro,
Malvina Nissim, Natalie Parde, Ond\v{r}ej Pl\'atek, Verena Rieser, Jie Ruan,
Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, Diyi
Yang
- Abstract summary: Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
- Score: 84.08476873280644
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We report our efforts in identifying a set of previous human evaluations in
NLP that would be suitable for a coordinated study examining what makes human
evaluations in NLP more/less reproducible. We present our results and findings,
which include that just 13\% of papers had (i) sufficiently low barriers to
reproduction, and (ii) enough obtainable information, to be considered for
reproduction, and that all but one of the experiments we selected for
reproduction was discovered to have flaws that made the meaningfulness of
conducting a reproduction questionable. As a result, we had to change our
coordinated study design from a reproduce approach to a
standardise-then-reproduce-twice approach. Our overall (negative) finding that
the great majority of human evaluations in NLP is not repeatable and/or not
reproducible and/or too flawed to justify reproduction, paints a dire picture,
but presents an opportunity for a rethink about how to design and report human
evaluations in NLP.
Related papers
- ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations [16.591822946975547]
This paper reproduces the findings of NLP research regarding human evaluation.
The results lend support to the original findings, with similar patterns seen between the original work and our reproduction.
arXiv Detail & Related papers (2024-04-26T15:31:25Z) - Human Feedback is not Gold Standard [28.63384327791185]
We critically analyse the use of human feedback for both training and evaluation.
We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality.
arXiv Detail & Related papers (2023-09-28T11:18:20Z) - With a Little Help from the Authors: Reproducing Human Evaluation of an
MT Error Detector [4.636982694364995]
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations.
Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving.
arXiv Detail & Related papers (2023-08-12T11:00:59Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long
Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial.
We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Toward Verifiable and Reproducible Human Evaluation for Text-to-Image
Generation [35.8129864412223]
This paper proposes a standardized and well-defined human evaluation protocol.
We experimentally show that the current automatic measures are incompatible with human perception.
We provide insights for designing human evaluation experiments reliably and conclusively.
arXiv Detail & Related papers (2023-04-04T14:14:16Z) - Near-Negative Distinction: Giving a Second Life to Human Evaluation
Datasets [95.4182455942628]
We propose Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests.
In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error.
We show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics.
arXiv Detail & Related papers (2022-05-13T20:02:53Z) - Quantified Reproducibility Assessment of NLP Results [5.181381829976355]
This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology.
We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results.
The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
arXiv Detail & Related papers (2022-04-12T17:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.