An empirical study of ChatGPT-3.5 on question answering and code
maintenance
- URL: http://arxiv.org/abs/2310.02104v1
- Date: Tue, 3 Oct 2023 14:48:32 GMT
- Title: An empirical study of ChatGPT-3.5 on question answering and code
maintenance
- Authors: Md Mahir Asef Kabir, Sk Adnan Hassan, Xiaoyin Wang, Ying Wang, Hai Yu,
Na Meng
- Abstract summary: A rising concern is whether ChatGPT will replace programmers and kill jobs.
We conducted an empirical study to systematically compare ChatGPT against programmers in question-answering and software-maintaining.
- Score: 14.028497274245227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ever since the launch of ChatGPT in 2022, a rising concern is whether ChatGPT
will replace programmers and kill jobs. Motivated by this widespread concern,
we conducted an empirical study to systematically compare ChatGPT against
programmers in question-answering and software-maintaining. We reused a dataset
introduced by prior work, which includes 130 StackOverflow (SO) discussion
threads referred to by the Java developers of 357 GitHub projects. We mainly
investigated three research questions (RQs). First, how does ChatGPT compare
with programmers when answering technical questions? Second, how do developers
perceive the differences between ChatGPT's answers and SO answers? Third, how
does ChatGPT compare with humans when revising code for maintenance requests?
For RQ1, we provided the 130 SO questions to ChatGPT, and manually compared
ChatGPT answers with the accepted/most popular SO answers in terms of
relevance, readability, informativeness, comprehensiveness, and reusability.
For RQ2, we conducted a user study with 30 developers, asking each developer to
assess and compare 10 pairs of answers, without knowing the information source
(i.e., ChatGPT or SO). For RQ3, we distilled 48 software maintenance tasks from
48 GitHub projects citing the studied SO threads. We queried ChatGPT to revise
a given Java file, and to incorporate the code implementation for any
prescribed maintenance requirement. Our study reveals interesting phenomena:
For the majority of SO questions (97/130), ChatGPT provided better answers; in
203 of 300 ratings, developers preferred ChatGPT answers to SO answers; ChatGPT
revised code correctly for 22 of the 48 tasks. Our research will expand
people's knowledge of ChatGPT capabilities, and shed light on future adoption
of ChatGPT by the software industry.
Related papers
- An exploratory analysis of Community-based Question-Answering Platforms and GPT-3-driven Generative AI: Is it the end of online community-based learning? [0.6749750044497732]
ChatGPT offers software engineers an interactive alternative to community question-answering platforms like Stack Overflow.
We analyze 2564 Python and JavaScript questions from StackOverflow that were asked between January 2022 and December 2022.
Our analysis indicates that ChatGPT's responses are 66% shorter and share 35% more words with the questions, showing a 25% increase in positive sentiment compared to human responses.
arXiv Detail & Related papers (2024-09-26T02:17:30Z) - ChatGPT Incorrectness Detection in Software Reviews [0.38233569758620056]
We developed a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses.
In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.
arXiv Detail & Related papers (2024-03-25T00:50:27Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of
ChatGPT Answers to Stack Overflow Questions [7.065853028825656]
We conducted the first in-depth analysis of ChatGPT answers to programming questions on Stack Overflow.
We examined the correctness, consistency, comprehensiveness, and conciseness of ChatGPT answers.
Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose.
arXiv Detail & Related papers (2023-08-04T13:23:20Z) - Evaluating Privacy Questions From Stack Overflow: Can ChatGPT Compete? [1.231476564107544]
ChatGPT has been used as an alternative to generate code or produce responses to developers' questions.
Our results show that most privacy-related questions are related to choice/consent, aggregation, and identification.
arXiv Detail & Related papers (2023-06-19T21:33:04Z) - ChatGPT: A Study on its Utility for Ubiquitous Software Engineering
Tasks [2.084078990567849]
ChatGPT (Chat Generative Pre-trained Transformer) launched by OpenAI on November 30, 2022.
In this study, we explore how ChatGPT can be used to help with common software engineering tasks.
arXiv Detail & Related papers (2023-05-26T11:29:06Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models [49.52083248451775]
Large language models (LLMs) have made significant progress in NLP.
We specifically focus on ChatGPT, a widely used and easily accessible LLM.
We conduct a series of experiments on 11 datasets to evaluate ChatGPT's commonsense abilities.
arXiv Detail & Related papers (2023-03-29T03:05:43Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.