Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment
- URL: http://arxiv.org/abs/2602.07059v1
- Date: Thu, 05 Feb 2026 08:32:29 GMT
- Title: Assessing Reproducibility in Evolutionary Computation: A Case Study using Human- and LLM-based Assessment
- Authors: Francesca Da Ros, Tarik Začiragić, Aske Plaat, Thomas Bäck, Niki van Stein,
- Abstract summary: We study the practices in papers published in the Combinatorial Optimization and Metaheuristics track of the Evolutionary Computation Conference over a ten-year period.<n>We introduce a structured checklist and apply it through a systematic manual assessment of the selected corpus.<n>In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an automated system that automatically evaluates signals from paper text and associated code.
- Score: 2.0365636651755263
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.
Related papers
- Large Language Models for Software Engineering: A Reproducibility Crisis [4.730658148470817]
This paper presents the first large-scale, empirical study of practices in large language model (LLM)-based software engineering research.<n>We systematically mined and analyzed 640 papers published between 2017 and 2025 across premier software engineering, machine learning, and natural language processing venues.<n>Our analysis reveals persistent gaps in artifact availability, environment specification, versioning rigor, and documentation clarity.
arXiv Detail & Related papers (2025-11-29T22:16:47Z) - AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research [81.04845910798387]
Generating natural language explanations for threat detections remains an open problem in cybersecurity research.<n>We present AutoMalDesc, an automated static analysis summarization framework that operates independently at scale.<n>We publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) datasets, along with our methodology and evaluation framework.
arXiv Detail & Related papers (2025-11-17T13:05:25Z) - NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation [58.30936615525824]
We present NAIPv2, a debiased and efficient framework for paper quality estimation.<n> NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings.<n>It is trained on pairwise comparisons but enabling efficient pointwise prediction at deployment.
arXiv Detail & Related papers (2025-09-29T17:59:23Z) - CRACQ: A Multi-Dimensional Approach To Automated Document Assessment [0.0]
CRACQ is a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality.<n>It integrates linguistic, semantic, and structural signals into a cumulative assessment, enabling both holistic and trait-level analysis.
arXiv Detail & Related papers (2025-09-26T17:01:54Z) - Automatic Classification of User Requirements from Online Feedback -- A Replication Study [0.0]
We replicate a previous NLP4RE study (baseline), which evaluated different deep learning models for requirement classification from user reviews.<n>We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study.<n>Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models.
arXiv Detail & Related papers (2025-07-29T06:52:27Z) - AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage [62.049868205196425]
AutoReproduce is a framework capable of automatically reproducing experiments described in research papers in an end-to-end manner.<n>Results show that AutoReproduce achieves an average performance gap of $22.1%$ on $89.74%$ of the executable experiment runs.
arXiv Detail & Related papers (2025-05-27T03:15:21Z) - Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries.
Experimental results show that our method improves consistently over existing methods.
Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z) - On the Effectiveness of Automated Metrics for Text Generation Systems [4.661309379738428]
We propose a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets.
The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems.
arXiv Detail & Related papers (2022-10-24T08:15:28Z) - Predicting the Reproducibility of Social and Behavioral Science Papers
Using Supervised Learning Models [21.69933721765681]
We propose a framework that extracts five types of features from scholarly work that can be used to support assessments of published research claims.
We analyze pairwise correlations between individual features and their importance for predicting a set of human-assessed ground truth labels.
arXiv Detail & Related papers (2021-04-08T00:45:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.