Related papers: Automated Reproducibility Has a Problem Statement Problem

Automated Reproducibility Has a Problem Statement Problem

URL: http://arxiv.org/abs/2601.04226v1
Date: Tue, 30 Dec 2025 15:56:49 GMT
Title: Automated Reproducibility Has a Problem Statement Problem
Authors: Thijs Snelleman, Peter Lundestad Lawrence, Holger H. Hoos, Odd Erik Gundersen,
Abstract summary: Reproducibility is essential to the scientific method, but reproduction is often a laborious task.<n>Recent works have attempted to automate this process and relieve researchers of this workload.<n>We hypothesise that we can represent any empirical study using a structure based on the scientific method.
Score: 9.222158486723012
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background. Reproducibility is essential to the scientific method, but reproduction is often a laborious task. Recent works have attempted to automate this process and relieve researchers of this workload. However, due to varying definitions of reproducibility, a clear problem statement is missing. Objectives. Create a generalisable problem statement, applicable to any empirical study. We hypothesise that we can represent any empirical study using a structure based on the scientific method and that this representation can be automatically extracted from any publication, and captures the essence of the study. Methods. We apply our definition of reproducibility as a problem statement for the automatisation of reproducibility by automatically extracting the hypotheses, experiments and interpretations of 20 studies and assess the quality based on assessments by the original authors of each study. Results. We create a dataset representing the reproducibility problem, consisting of the representation of 20 studies. The majority of author feedback is positive, for all parts of the representation. In a few cases, our method failed to capture all elements of the study. We also find room for improvement at capturing specific details, such as results of experiments. Conclusions. We conclude that our formulation of the problem is able to capture the concept of reproducibility in empirical AI studies across a wide range of subfields. Authors of original publications generally agree that the produced structure is representative of their work; we believe improvements can be achieved by applying our findings to create a more structured and fine-grained output in future work.

Related papers

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage [62.049868205196425]
AutoReproduce is a framework capable of automatically reproducing experiments described in research papers in an end-to-end manner.<n>Results show that AutoReproduce achieves an average performance gap of $22.1%$ on $89.74%$ of the executable experiment runs.
arXiv Detail & Related papers (2025-05-27T03:15:21Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations.<n>We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate.<n>Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study [61.74571814707054]
We evaluate whether every generated sentence is grounded in retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations.
arXiv Detail & Related papers (2024-04-10T14:50:10Z)
Time to Stop and Think: What kind of research do we want to do? [1.74048653626208]
In this paper, we focus on the field of metaheuristic optimization, since it is our main field of work. Our main goal is to sew the seed of sincere critical assessment of our work, sparking a reflection process both at the individual and the community level. All the statements included in this document are personal views and opinions, which can be shared by others or not.
arXiv Detail & Related papers (2024-02-13T08:53:57Z)
In-class Data Analysis Replications: Teaching Students while Testing Science [16.951059542542843]
In the present study, we incorporated data analysis replications in the project component of the Applied Data Analysis course taught at EPFL. We find discrepancies between what students expect of data analysis replications and what they experience. We identify tangible benefits of the in-class data analysis replications for scientific communities.
arXiv Detail & Related papers (2023-08-31T06:53:22Z)
Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP [84.08476873280644]
Just 13% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach.
arXiv Detail & Related papers (2023-05-02T17:46:12Z)
Reproducibility in machine learning for medical imaging [3.1390096961027076]
This chapter intends at being an introduction to for researchers in the field of machine learning for medical imaging. For each of them, we aim at defining it, at describing the requirements to achieve it and at discussing its utility. The chapter ends with a discussion on the benefits of didactic and with a plea for a non-dogmatic approach to this concept and its implementation in research practice.
arXiv Detail & Related papers (2022-09-12T09:00:04Z)
Sources of Irreproducibility in Machine Learning: A Review [3.905855359082687]
There exist no theoretical framework that relates experiment design choices to potential effects on the conclusions. The objective of this paper is to develop a framework that enables applied data science practitioners and researchers to understand which experiment design choices can lead to false findings.
arXiv Detail & Related papers (2022-04-15T18:26:03Z)
The Fundamental Principles of Reproducibility [2.4671396651514983]
I take a fundamental view on rooted in the scientific method. The scientific method is analysed and characterised in order to develop the terminology required to define.
arXiv Detail & Related papers (2020-11-19T20:37:58Z)
Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process. This paper provides the first study of how these explanations can be generated automatically based on available claim context. Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.