Related papers: QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing

URL: http://arxiv.org/abs/2505.17043v1
Date: Tue, 13 May 2025 13:04:04 GMT
Title: QRA++: Quantified Reproducibility Assessment for Common Types of Results in Natural Language Processing
Authors: Anya Belz,
Abstract summary: We present QRA++, a quantitative approach to assessment that produces continuous-valued degree of assessments at three levels of granularity.<n>We illustrate this by applying QRA++ to three example sets of comparable experiments.
Score: 6.653947064461629
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Reproduction studies reported in NLP provide individual data points which in combination indicate worryingly low levels of reproducibility in the field. Because each reproduction study reports quantitative conclusions based on its own, often not explicitly stated, criteria for reproduction success/failure, the conclusions drawn are hard to interpret, compare, and learn from. In this paper, we present QRA++, a quantitative approach to reproducibility assessment that (i) produces continuous-valued degree of reproducibility assessments at three levels of granularity; (ii) utilises reproducibility measures that are directly comparable across different studies; and (iii) grounds expectations about degree of reproducibility in degree of similarity between experiments. QRA++ enables more informative reproducibility assessments to be conducted, and conclusions to be drawn about what causes reproducibility to be better/poorer. We illustrate this by applying QRA++ to three example sets of comparable experiments, revealing clear evidence that degree of reproducibility depends on similarity of experiment properties, but also system type and evaluation method.

Related papers

Automatic Classification of User Requirements from Online Feedback -- A Replication Study [0.0]
We replicate a previous NLP4RE study (baseline), which evaluated different deep learning models for requirement classification from user reviews.<n>We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study.<n>Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models.
arXiv Detail & Related papers (2025-07-29T06:52:27Z)
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback [128.2992631982687]
We introduce the task of experiment-guided ranking, which aims to prioritize candidate hypotheses based on the results of previously tested ones.<n>We propose a simulator grounded in three domain-informed assumptions, modeling hypothesis performance as a function of similarity to a known ground truth hypothesis.<n>We curate a dataset of 124 chemistry hypotheses with experimentally reported outcomes to validate the simulator.
arXiv Detail & Related papers (2025-05-23T13:24:50Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.<n>We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.<n>The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
Challenges and Considerations in the Evaluation of Bayesian Causal Discovery [49.0053848090947]
Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, causal discovery presents challenges due to the nature of its quantity. No consensus on the most suitable metric for evaluation.
arXiv Detail & Related papers (2024-06-05T12:45:23Z)
Can citations tell us about a paper's reproducibility? A case study of machine learning papers [3.5120846057971065]
Resource constraints and inadequate documentation can make running replications particularly challenging. We introduce a sentiment analysis framework applied to citation contexts from papers involved in Machine Learning Reproducibility Challenges.
arXiv Detail & Related papers (2024-05-07T03:29:11Z)
Gender Biases in Automatic Evaluation Metrics for Image Captioning [87.15170977240643]
We conduct a systematic study of gender biases in model-based evaluation metrics for image captioning tasks. We demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations. We present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments.
arXiv Detail & Related papers (2023-05-24T04:27:40Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z)
Causal Inference via Nonlinear Variable Decorrelation for Healthcare Applications [60.26261850082012]
We introduce a novel method with a variable decorrelation regularizer to handle both linear and nonlinear confounding. We employ association rules as new representations using association rule mining based on the original features to increase model interpretability.
arXiv Detail & Related papers (2022-09-29T17:44:14Z)
A reproducible experimental survey on biomedical sentence similarity: a string-based method sets the state of the art [0.0]
This report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aim is to elucidate the state of the art of the problem and to solve some problems preventing the evaluation of most of current methods. Our experiments confirm that the pre-processing stages, and the choice of the NER tool, have a significant impact on the performance of the sentence similarity methods.
arXiv Detail & Related papers (2022-05-18T06:20:42Z)
Quantified Reproducibility Assessment of NLP Results [5.181381829976355]
This paper describes and tests a method for carrying out quantified assessment (QRA) based on concepts and definitions from metrology. We test QRA on 18 system and evaluation measure combinations, for each of which we have the original results and one to seven reproduction results. The proposed QRA method produces degree-of-reproducibility scores that are comparable across multiple reproductions not only of the same, but of different original studies.
arXiv Detail & Related papers (2022-04-12T17:22:46Z)
Learning from Aggregate Observations [82.44304647051243]
We study the problem of learning from aggregate observations where supervision signals are given to sets of instances. We present a general probabilistic framework that accommodates a variety of aggregate observations. Simple maximum likelihood solutions can be applied to various differentiable models.
arXiv Detail & Related papers (2020-04-14T06:18:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.