With a Little Help from the Authors: Reproducing Human Evaluation of an
MT Error Detector
- URL: http://arxiv.org/abs/2308.06527v1
- Date: Sat, 12 Aug 2023 11:00:59 GMT
- Title: With a Little Help from the Authors: Reproducing Human Evaluation of an
MT Error Detector
- Authors: Ond\v{r}ej Pl\'atek and Mateusz Lango and Ond\v{r}ej Du\v{s}ek
- Abstract summary: This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations.
Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving.
- Score: 4.636982694364995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work presents our efforts to reproduce the results of the human
evaluation experiment presented in the paper of Vamvas and Sennrich (2022),
which evaluated an automatic system detecting over- and undertranslations
(translations containing more or less information than the original) in machine
translation (MT) outputs. Despite the high quality of the documentation and
code provided by the authors, we discuss some problems we found in reproducing
the exact experimental setup and offer recommendations for improving
reproducibility. Our replicated results generally confirm the conclusions of
the original study, but in some cases, statistically significant differences
were observed, suggesting a high variability of human annotation.
Related papers
- ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations [16.591822946975547]
This paper reproduces the findings of NLP research regarding human evaluation.
The results lend support to the original findings, with similar patterns seen between the original work and our reproduction.
arXiv Detail & Related papers (2024-04-26T15:31:25Z) - Physician Detection of Clinical Harm in Machine Translation: Quality
Estimation Aids in Reliance and Backtranslation Identifies Critical Errors [27.13497855061732]
This paper evaluates quality estimation feedback in vivo with a human study simulating decision-making in high-stakes medical settings.
We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.
arXiv Detail & Related papers (2023-10-25T18:44:14Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models [57.80514758695275]
Using large language models (LLMs) for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level.
We propose a new prompting method called textbftextttError Analysis Prompting (EAPrompt)
This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM) and textitproduces explainable and reliable MT evaluations at both the system and segment level.
arXiv Detail & Related papers (2023-03-24T05:05:03Z) - Quantified Reproducibility Assessment of NLP Results [5.181381829976355]
This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology.
We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results.
The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
arXiv Detail & Related papers (2022-04-12T17:22:46Z) - As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical
Translation [51.20569527047729]
Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation.
We develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing.
arXiv Detail & Related papers (2021-07-18T04:09:47Z) - Reproducibility Companion Paper: Knowledge Enhanced Neural Fashion Trend
Forecasting [78.046352507802]
We provide an artifact that allows the replication of the experiments using a Python implementation.
We reproduce the experiments conducted in the original paper and obtain similar performance as previously reported.
arXiv Detail & Related papers (2021-05-25T10:53:11Z) - Manual Evaluation Matters: Reviewing Test Protocols of Distantly
Supervised Relation Extraction [61.48964753725744]
We build manually-annotated test sets for two DS-RE datasets, NYT10 and Wiki20, and thoroughly evaluate several competitive models.
Results show that the manual evaluation can indicate very different conclusions from automatic ones.
arXiv Detail & Related papers (2021-05-20T06:55:40Z) - Experts, Errors, and Context: A Large-Scale Study of Human Evaluation
for Machine Translation [19.116396693370422]
We propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics framework.
We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs.
We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers.
arXiv Detail & Related papers (2021-04-29T16:42:09Z) - A Set of Recommendations for Assessing Human-Machine Parity in Language
Translation [87.72302201375847]
We reassess Hassan et al.'s investigation into Chinese to English news translation.
We show that the professional human translations contained significantly fewer errors.
arXiv Detail & Related papers (2020-04-03T17:49:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.