Related papers: Can GPT-4 Replicate Empirical Software Engineering Research?

Can GPT-4 Replicate Empirical Software Engineering Research?

URL: http://arxiv.org/abs/2310.01727v3
Date: Wed, 19 Jun 2024 07:17:28 GMT
Title: Can GPT-4 Replicate Empirical Software Engineering Research?
Authors: Jenny T. Liang, Carmen Badea, Christian Bird, Robert DeLine, Denae Ford, Nicole Forsgren, Thomas Zimmermann,
Abstract summary: We study GPT-4's abilities to perform replications of empirical software engineering research on new data. We find GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data.
Score: 20.89031544114989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Empirical software engineering research on production systems has brought forth a better understanding of the software engineering process for practitioners and researchers alike. However, only a small subset of production systems is studied, limiting the impact of this research. While software engineering practitioners could benefit from replicating research on their own data, this poses its own set of challenges, since performing replications requires a deep understanding of research methodologies and subtle nuances in software engineering data. Given that large language models (LLMs), such as GPT-4, show promise in tackling both software engineering- and science-related tasks, these models could help replicate and thus democratize empirical software engineering research. In this paper, we examine GPT-4's abilities to perform replications of empirical software engineering research on new data. We study their ability to surface assumptions made in empirical software engineering research methodologies, as well as their ability to plan and generate code for analysis pipelines on seven empirical software engineering papers. We perform a user study with 14 participants with software engineering research expertise, who evaluate GPT-4-generated assumptions and analysis plans (i.e., a list of module specifications) from the papers. We find that GPT-4 is able to surface correct assumptions, but struggles to generate ones that apply common knowledge about software engineering data. In a manual analysis of the generated code, we find that the GPT-4-generated code contains correct high-level logic, given a subset of the methodology. However, the code contains many small implementation-level errors, reflecting a lack of software engineering knowledge. Our findings have implications for leveraging LLMs for software engineering research as well as practitioner data scientists in software teams.

Related papers

Comparing Human and LLM Generated Code: The Jury is Still Out! [8.456554883523472]
We compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. We observe security flaws in code generated by both humans and GPT-4, but GPT-4 code included more severe outliers.
arXiv Detail & Related papers (2025-01-28T11:11:36Z)
A case study on the transformative potential of AI in software engineering on LeetCode and ChatGPT [0.0]
This study employs a methodological approach, with the objective of comparing the software quality of Python programs produced by LeetCode users with that generated by GPT-4o. The findings indicate that GPT-4o does not present a considerable impediment to code quality, understandability, or runtime when generating code on a limited scale.
arXiv Detail & Related papers (2025-01-07T09:15:25Z)
A Systematic Literature Review on the Use of Machine Learning in Software Engineering [0.0]
The study was carried out following the objective and the research questions to explore the current state of the art in applying machine learning techniques in software engineering processes. The review identifies the key areas within software engineering where ML has been applied, including software quality assurance, software maintenance, software comprehension, and software documentation.
arXiv Detail & Related papers (2024-06-19T23:04:27Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
Morescient GAI for Software Engineering [2.4861619769660637]
Using Generative AI (GAI) for software engineering tasks is one of the most rapidly expanding fields of software engineering research. We present a vision for how such "Morescient" GAI models can be engineered, evolved and disseminated according to the principles of open science.
arXiv Detail & Related papers (2024-06-07T07:38:33Z)
JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models [110.45794710162241]
Existing work either collects large-scale math-related texts for pre-training, or relies on stronger LLMs to synthesize massive math problems. We propose an efficient way that trains a small LLM for math problem synthesis, to efficiently generate sufficient high-quality pre-training data. We leverage it to synthesize 6 million math problems for pre-training our JiuZhang3.0 model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6B data.
arXiv Detail & Related papers (2024-05-23T09:43:19Z)
Requirements Engineering for Research Software: A Vision [2.2217676348694213]
Most researchers creating software for scientific purposes are not trained in Software Engineering. Research software is often developed ad hoc without following stringent processes. We describe how researchers elicit, document, and analyze requirements for research software.
arXiv Detail & Related papers (2024-05-13T14:25:01Z)
GPT-4 as an interface between researchers and computational software: improving usability and reproducibility [44.99833362998488]
We focus on a widely used software for molecular dynamics simulations. We quantify the usefulness of input files generated by GPT-4 from task descriptions in English. We find that GPT-4 can generate correct and ready-to-use input files for relatively simple tasks. In addition, GPT-4's description of computational tasks from input files can be tuned from a detailed set of step-by-step instructions to a summary description appropriate for publications.
arXiv Detail & Related papers (2023-10-04T14:25:39Z)
Using Machine Learning To Identify Software Weaknesses From Software Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications. Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z)
Exploring and Characterizing Large Language Models For Embedded System Development and Debugging [10.967443876391611]
Large language models (LLMs) have shown remarkable abilities to generate code, however their ability to develop software for embedded systems has not been studied. We develop an open source framework to evaluate leading LLMs to assess their capabilities and limitations for embedded system development. We leverage this finding to study how human programmers interact with these tools, and develop an human-AI based software engineering workflow for building embedded systems.
arXiv Detail & Related papers (2023-07-07T20:14:22Z)
Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains. We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4. Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z)
Machine Learning for Software Engineering: A Systematic Mapping [73.30245214374027]
The software development industry is rapidly adopting machine learning for transitioning modern day software systems towards highly intelligent and self-learning systems. No comprehensive study exists that explores the current state-of-the-art on the adoption of machine learning across software engineering life cycle stages. This study introduces a machine learning for software engineering (MLSE) taxonomy classifying the state-of-the-art machine learning techniques according to their applicability to various software engineering life cycle stages.
arXiv Detail & Related papers (2020-05-27T11:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.