Oops!... I did it again. Conclusion (In-)Stability in Quantitative Empirical Software Engineering: A Large-Scale Analysis
- URL: http://arxiv.org/abs/2510.06844v1
- Date: Wed, 08 Oct 2025 10:11:39 GMT
- Title: Oops!... I did it again. Conclusion (In-)Stability in Quantitative Empirical Software Engineering: A Large-Scale Analysis
- Authors: Nicole Hoess, Carlos Paradis, Rick Kazman, Wolfgang Mauerer,
- Abstract summary: Mining software repositories is a popular means to gain insights into a software project's evolution.<n>This study investigates some threats to validity in complex tool pipelines for evolutionary software analyses.
- Score: 5.94721915761333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Context: Mining software repositories is a popular means to gain insights into a software project's evolution, monitor project health, support decisions and derive best practices. Tools supporting the mining process are commonly applied by researchers and practitioners, but their limitations and agreement are often not well understood. Objective: This study investigates some threats to validity in complex tool pipelines for evolutionary software analyses and evaluates the tools' agreement in terms of data, study outcomes and conclusions for the same research questions. Method: We conduct a lightweight literature review to select three studies on collaboration and coordination, software maintenance and software quality from high-ranked venues, which we formally replicate with four independent, systematically selected mining tools to quantitatively and qualitatively compare the extracted data, analysis results and conclusions. Results: We find that numerous technical details in tool design and implementation accumulate along the complex mining pipelines and can cause substantial differences in the extracted baseline data, its derivatives, subsequent results of statistical analyses and, under specific circumstances, conclusions. Conclusions: Users must carefully choose tools and evaluate their limitations to assess the scope of validity in an adequate way. Reusing tools is recommended. Researchers and tool authors can promote reusability and help reducing uncertainties by reproduction packages and comparative studies following our approach.
Related papers
- ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval [11.41528830724814]
We present ScholarGym, a simulation environment for reproducible evaluation of deep research on academic literature.<n>Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth.
arXiv Detail & Related papers (2026-01-29T12:51:44Z) - DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - Exploring the Garden of Forking Paths in Empirical Software Engineering Research: A Multiverse Analysis [3.6324565773746147]
We conduct a so-called multiverse analysis on a published empirical SE paper.<n>We identify nine pivotal analytical decisions with at least one equally defensible alternative.<n>The overwhelming majority produced qualitatively different, and sometimes even opposite, findings.
arXiv Detail & Related papers (2025-12-09T18:47:00Z) - Analytical Survey of Learning with Low-Resource Data: From Analysis to Investigation [192.53529928861818]
Learning with high-resource data has demonstrated substantial success in artificial intelligence (AI)<n>However, the costs associated with data annotation and model training remain significant.<n>This survey employs active sampling theory to analyze the generalization error and label complexity associated with learning from low-resource data.
arXiv Detail & Related papers (2025-10-10T03:15:42Z) - Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning [68.89572566071575]
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools.<n>We propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately.<n> Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light.
arXiv Detail & Related papers (2025-09-27T12:53:37Z) - Does the Tool Matter? Exploring Some Causes of Threats to Validity in Mining Software Repositories [9.539825294372786]
We use two tools to extract and analyse ten large software projects.<n>Despite similar trends, even simple metrics such as the numbers of commits and developers may differ by up to 500%.<n>We find that such substantial differences are often caused by minor technical details.
arXiv Detail & Related papers (2025-01-25T07:42:56Z) - A Computational Method for Measuring "Open Codes" in Qualitative Analysis [44.39424825305388]
This paper presents a theory-informed computational method for measuring inductive coding results from humans and Generative AI (GAI)<n>It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence.<n>Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
arXiv Detail & Related papers (2024-11-19T00:44:56Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - Efficacy of static analysis tools for software defect detection on open-source projects [0.0]
The study used popular analysis tools such as SonarQube, PMD, Checkstyle, and FindBugs to perform the comparison.
The study results show that SonarQube performs considerably well than all other tools in terms of its defect detection.
arXiv Detail & Related papers (2024-05-20T19:05:32Z) - Toward Unified Practices in Trajectory Prediction Research on Bird's-Eye-View Datasets [3.1406146587437904]
The availability of high-quality datasets is crucial for the development of behavior prediction algorithms in autonomous vehicles.<n>This paper highlights the need to standardize the use of certain datasets for motion forecasting research.<n>We propose a set of tools and practices to achieve this.
arXiv Detail & Related papers (2024-05-01T16:17:39Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [51.26815896167173]
We present a comprehensive tertiary analysis of PAMI reviews along three complementary dimensions.<n>Our analyses reveal distinctive organizational patterns as well as persistent gaps in current review practices.<n>Finally, our evaluation of state-of-the-art AI-generated reviews indicates encouraging advances in coherence and organization.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - Distributed intelligence on the Edge-to-Cloud Continuum: A systematic
literature review [62.997667081978825]
This review aims at providing a comprehensive vision of the main state-of-the-art libraries and frameworks for machine learning and data analytics available today.
The main simulation, emulation, deployment systems, and testbeds for experimental research on the Edge-to-Cloud Continuum available today are also surveyed.
arXiv Detail & Related papers (2022-04-29T08:06:05Z) - Open Source Software for Efficient and Transparent Reviews [0.11179881480027788]
ASReview is an open source machine learning-aided pipeline applying active learning.
We demonstrate by means of simulation studies that ASReview can yield far more efficient reviewing than manual reviewing.
arXiv Detail & Related papers (2020-06-22T11:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.