Related papers: The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts

The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts

URL: http://arxiv.org/abs/2601.02066v1
Date: Mon, 05 Jan 2026 12:47:43 GMT
Title: The State of Open Science in Software Engineering Research: A Case Study of ICSE Artifacts
Authors: Al Muttakin, Saikat Mondal, Chanchal Roy,
Abstract summary: There is a marked lack of studies that comprehensively examine the executability and rigor of replication packages in software engineering (SE) research.<n>We evaluate 100 replication packages published as part of ICSE proceedings over the past decade.<n>Our findings reveal that only 40% of the 100 artifacts evaluated were executable, of which 32.5% (13 out of 40) ran without any modification.
Score: 2.5705703401045557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Replication packages are crucial for enabling transparency, validation, and reuse in software engineering (SE) research. While artifact sharing is now a standard practice and even expected at premier SE venues such as ICSE, the practical usability of these replication packages remains underexplored. In particular, there is a marked lack of studies that comprehensively examine the executability and reproducibility of replication packages in SE research. In this paper, we aim to fill this gap by evaluating 100 replication packages published as part of ICSE proceedings over the past decade (2015--2024). We assess the (1) executability of the replication packages, (2) efforts and modifications required to execute them, (3) challenges that prevent executability, and (4) reproducibility of the original findings. We spent approximately 650 person-hours in total executing the artifacts and reproducing the study findings. Our findings reveal that only 40\% of the 100 evaluated artifacts were executable, of which 32.5\% (13 out of 40) ran without any modification. Regarding effort levels, 17.5\% (7 out of 40) required low effort, while 82.5\% (33 out of 40) required moderate to high effort to execute successfully. We identified five common types of modifications and 13 challenges leading to execution failure, spanning environmental, documentation, and structural issues. Among the executable artifacts, only 35\% (14 out of 40) reproduced the original results. These findings highlight a notable gap between artifact availability, executability, and reproducibility. Our study proposes three actionable guidelines to improve the preparation, documentation, and review of research artifacts, thereby strengthening the rigor and sustainability of open science practices in SE research.

Related papers

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z)
Agent-Based Software Artifact Evaluation [15.526715803442746]
Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years.<n>We propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation.
arXiv Detail & Related papers (2026-02-02T15:41:16Z)
An Audit of Machine Learning Experiments on Software Defect Prediction [1.2743036577573925]
Machine learning algorithms are widely used to predict defect prone software components.<n>This paper audits recent software defect prediction (SDP) studies by assessing their experimental design, analysis, and reporting practices.
arXiv Detail & Related papers (2026-01-26T13:31:32Z)
Chasing Shadows: Pitfalls in LLM Security Research [14.334369124449346]
We identify nine common pitfalls that have become relevant with the emergence of large language models (LLMs)<n>These pitfalls span the entire process, from data collection, pre-training, and fine-tuning to prompting and evaluation.<n>We find that every paper contains at least one pitfall, and each pitfall appears in multiple papers. Yet only 15.7% of the present pitfalls were explicitly discussed, suggesting that the majority remain unrecognized.
arXiv Detail & Related papers (2025-12-10T11:39:09Z)
Large Language Models for Software Engineering: A Reproducibility Crisis [4.730658148470817]
This paper presents the first large-scale, empirical study of practices in large language model (LLM)-based software engineering research.<n>We systematically mined and analyzed 640 papers published between 2017 and 2025 across premier software engineering, machine learning, and natural language processing venues.<n>Our analysis reveals persistent gaps in artifact availability, environment specification, versioning rigor, and documentation clarity.
arXiv Detail & Related papers (2025-11-29T22:16:47Z)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z)
Research Artifacts in Secondary Studies: A Systematic Mapping in Software Engineering [0.9421843976231371]
Systematic reviews (SRs) summarize state-of-the-art evidence in science, including software engineering (SE)<n>We examined 537 secondary studies published between 2013 and 2023 to analyze the availability and reporting of research artifacts.
arXiv Detail & Related papers (2025-04-17T05:11:39Z)
On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations [58.60617136236957]
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment.<n>DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games.<n>Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist.
arXiv Detail & Related papers (2025-03-28T16:25:06Z)
O1 Replication Journey: A Strategic Progress Report -- Part 1 [52.062216849476776]
This paper introduces a pioneering approach to artificial intelligence research, embodied in our O1 Replication Journey. Our methodology addresses critical challenges in modern AI research, including the insularity of prolonged team-based projects. We propose the journey learning paradigm, which encourages models to learn not just shortcuts, but the complete exploration process.
arXiv Detail & Related papers (2024-10-08T15:13:01Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images [198.35937007558078]
The competition opened on 30th December, 2022 and closed on 24th March, 2023. There are 35 participants and 91 valid submissions received for Track 1, and 15 participants and 26 valid submissions received for Track 2. According to the performance of the submissions, we believe there is still a large gap on the expected information extraction performance for complex and zero-shot scenarios.
arXiv Detail & Related papers (2023-06-05T22:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.