Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
- URL: http://arxiv.org/abs/2510.25506v1
- Date: Wed, 29 Oct 2025 13:31:32 GMT
- Title: Reflections on the Reproducibility of Commercial LLM Performance in Empirical Software Engineering Studies
- Authors: Florian Angermeir, Maximilian Amougou, Mark Kreitz, Andreas Bauer, Matthias Linhuber, Davide Fucci, Fabiola Moyón C., Daniel Mendez, Tony Gorschek,
- Abstract summary: We studied 86 articles describing LLM-centric studies, published at ICSE 2024 and 2024.<n>Of the 86 articles, 18 provided research artefacts and used OpenAI models.<n>Of the 18 studies, only five were fit for reproduction. For none of the five studies, we were able to fully reproduce the results.
- Score: 3.053547151063031
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models have gained remarkable interest in industry and academia. The increasing interest in LLMs in academia is also reflected in the number of publications on this topic over the last years. For instance, alone 78 of the around 425 publications at ICSE 2024 performed experiments with LLMs. Conducting empirical studies with LLMs remains challenging and raises questions on how to achieve reproducible results, for both other researchers and practitioners. One important step towards excelling in empirical research on LLMs and their application is to first understand to what extent current research results are eventually reproducible and what factors may impede reproducibility. This investigation is within the scope of our work. We contribute an analysis of the reproducibility of LLM-centric studies, provide insights into the factors impeding reproducibility, and discuss suggestions on how to improve the current state. In particular, we studied the 86 articles describing LLM-centric studies, published at ICSE 2024 and ASE 2024. Of the 86 articles, 18 provided research artefacts and used OpenAI models. We attempted to replicate those 18 studies. Of the 18 studies, only five were fit for reproduction. For none of the five studies, we were able to fully reproduce the results. Two studies seemed to be partially reproducible, and three studies did not seem to be reproducible. Our results highlight not only the need for stricter research artefact evaluations but also for more robust study designs to ensure the reproducible value of future publications.
Related papers
- LLM-REVal: Can We Trust LLM Reviewers Yet? [70.58742663985652]
Large language models (LLMs) have inspired researchers to integrate them extensively into the academic workflow.<n>This study focuses on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness.
arXiv Detail & Related papers (2025-10-14T10:30:20Z) - A Survey of AIOps in the Era of Large Language Models [60.59720351854515]
We analyzed 183 research papers published between January 2020 and December 2024 to answer four key research questions (RQs)<n>We discuss the state-of-the-art advancements and trends, identify gaps in existing research, and propose promising directions for future exploration.
arXiv Detail & Related papers (2025-06-23T02:40:16Z) - MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? [66.87201770167012]
MLRC-Bench is a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions.<n>Unlike prior work, MLRC-Bench measures the key steps of proposing and implementing novel research methods.<n>Even the best-performing tested agent closes only 9.3% of the gap between baseline and top human participant scores.
arXiv Detail & Related papers (2025-04-13T19:35:43Z) - ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition [67.26124739345332]
Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined.<n>We introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery.<n>We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers.
arXiv Detail & Related papers (2025-03-27T08:09:15Z) - Highlighting Case Studies in LLM Literature Review of Interdisciplinary System Science [0.18416014644193066]
Large Language Models (LLMs) were used to assist four Commonwealth Scientific and Industrial Research Organisation (CSIRO) researchers.<n>We evaluate the performance of LLMs for systematic literature reviews.
arXiv Detail & Related papers (2025-03-16T05:52:18Z) - LLM4SR: A Survey on Large Language Models for Scientific Research [15.533076347375207]
Large Language Models (LLMs) offer unprecedented support across various stages of the research cycle.<n>This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process.
arXiv Detail & Related papers (2025-01-08T06:44:02Z) - Towards Evaluation Guidelines for Empirical Studies involving LLMs [6.174354685766166]
Large language models (LLMs) have changed the software engineering research landscape.<n>This paper contributes the first set of holistic guidelines for such studies.
arXiv Detail & Related papers (2024-11-12T09:35:23Z) - CycleResearcher: Improving Automated Research via Automated Review [37.03497673861402]
This paper explores the possibility of using open-source post-trained large language models (LLMs) as autonomous agents capable of performing the full cycle of automated research and review.<n>To train these models, we develop two new datasets, reflecting real-world machine learning research and peer review dynamics.<n>Our results demonstrate that CycleReviewer achieves promising performance with a 26.89% reduction in mean absolute error (MAE) compared to individual human reviewers in predicting paper scores.
arXiv Detail & Related papers (2024-10-28T08:10:21Z) - Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.26363107905344]
Large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery.
No evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas.
arXiv Detail & Related papers (2024-09-06T08:25:03Z) - Using Large Language Models to Create AI Personas for Replication, Generalization and Prediction of Media Effects: An Empirical Test of 133 Published Experimental Research Findings [0.3749861135832072]
This report analyzes the potential for large language models (LLMs) to expedite accurate replication and generalization of published research about message effects in marketing.<n>LLMs were tested by replicating 133 experimental findings from 14 papers containing 45 recent studies published in the Journal of Marketing.<n>The LLM replications successfully reproduced 76% of the original main effects (84 out of 111), demonstrating strong potential for AI-assisted replication.
arXiv Detail & Related papers (2024-08-28T18:14:39Z) - Awes, Laws, and Flaws From Today's LLM Research [0.0]
We assess over 2,000 research works released between 2020 and 2024 based on criteria typical of what is considered good research.<n>We find multiple trends, such as declines in ethics disclaimers, a rise of LLMs as evaluators, and an increase on claims of LLM reasoning abilities without leveraging human evaluation.
arXiv Detail & Related papers (2024-08-27T21:19:37Z) - Mapping the Increasing Use of LLMs in Scientific Papers [99.67983375899719]
We conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals.
Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers.
arXiv Detail & Related papers (2024-04-01T17:45:15Z) - A Comprehensive Overview of Large Language Models [68.22178313875618]
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks.
This article provides an overview of the existing literature on a broad range of LLM-related concepts.
arXiv Detail & Related papers (2023-07-12T20:01:52Z) - A Bibliometric Review of Large Language Models Research from 2017 to
2023 [1.4190701053683017]
Large language models (LLMs) are language models that have demonstrated outstanding performance across a range of natural language processing (NLP) tasks.
This paper serves as a roadmap for researchers, practitioners, and policymakers to navigate the current landscape of LLMs research.
arXiv Detail & Related papers (2023-04-03T21:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.