ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
- URL: http://arxiv.org/abs/2404.17481v2
- Date: Tue, 14 May 2024 17:36:13 GMT
- Title: ReproHum #0087-01: Human Evaluation Reproduction Report for Generating Fact Checking Explanations
- Authors: Tyler Loakman, Chenghua Lin,
- Abstract summary: This paper reproduces the findings of NLP research regarding human evaluation.
The results lend support to the original findings, with similar patterns seen between the original work and our reproduction.
- Score: 16.591822946975547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a partial reproduction of Generating Fact Checking Explanations by Anatanasova et al (2020) as part of the ReproHum element of the ReproNLP shared task to reproduce the findings of NLP research regarding human evaluation. This shared task aims to investigate the extent to which NLP as a field is becoming more or less reproducible over time. Following the instructions provided by the task organisers and the original authors, we collect relative rankings of 3 fact-checking explanations (comprising a gold standard and the outputs of 2 models) for 40 inputs on the criteria of Coverage. The results of our reproduction and reanalysis of the original work's raw results lend support to the original findings, with similar patterns seen between the original work and our reproduction. Whilst we observe slight variation from the original results, our findings support the main conclusions drawn by the original authors pertaining to the efficacy of their proposed models.
Related papers
- Evaluating Rank-N-Contrast: Continuous and Robust Representations for Regression [0.0]
This document is a replication of the original "Rank-N-Contrast" (arXiv:2210.01189v2) paper published in 2023.
arXiv Detail & Related papers (2024-11-25T11:31:53Z) - With a Little Help from the Authors: Reproducing Human Evaluation of an
MT Error Detector [4.636982694364995]
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations.
Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving.
arXiv Detail & Related papers (2023-08-12T11:00:59Z) - Missing Information, Unresponsive Authors, Experimental Flaws: The
Impossibility of Assessing the Reproducibility of Previous Human Evaluations
in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction.
As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z) - Analyzing and Evaluating Faithfulness in Dialogue Summarization [67.07947198421421]
We first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues.
We present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations.
arXiv Detail & Related papers (2022-10-21T07:22:43Z) - Quantified Reproducibility Assessment of NLP Results [5.181381829976355]
This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology.
We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results.
The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
arXiv Detail & Related papers (2022-04-12T17:22:46Z) - An Evaluation Study of Generative Adversarial Networks for Collaborative
Filtering [75.83628561622287]
This work successfully replicates the results published in the original paper and discusses the impact of certain differences between the CFGAN framework and the model used in the original evaluation.
The work further expands the experimental analysis comparing CFGAN against a selection of simple and well-known properly optimized baselines, observing that CFGAN is not consistently competitive against them despite its high computational cost.
arXiv Detail & Related papers (2022-01-05T20:53:27Z) - Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks [59.761411682238645]
Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks.
We introduce a method to incorporate evidentiality of passages -- whether a passage contains correct evidence to support the output -- into training the generator.
arXiv Detail & Related papers (2021-12-16T08:18:47Z) - Reproducibility Companion Paper: Knowledge Enhanced Neural Fashion Trend
Forecasting [78.046352507802]
We provide an artifact that allows the replication of the experiments using a Python implementation.
We reproduce the experiments conducted in the original paper and obtain similar performance as previously reported.
arXiv Detail & Related papers (2021-05-25T10:53:11Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Generating (Factual?) Narrative Summaries of RCTs: Experiments with
Neural Multi-Document Summarization [22.611879349101596]
We evaluate modern neural models for abstractive summarization of relevant article abstracts from systematic reviews.
We find that modern summarization systems yield consistently fluent and relevant synopses, but that they are not always factual.
arXiv Detail & Related papers (2020-08-25T22:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.