Related papers: ChatGPT Incorrectness Detection in Software Reviews

ChatGPT Incorrectness Detection in Software Reviews

URL: http://arxiv.org/abs/2403.16347v1
Date: Mon, 25 Mar 2024 00:50:27 GMT
Title: ChatGPT Incorrectness Detection in Software Reviews
Authors: Minaoar Hossain Tanzil, Junaed Younus Khan, Gias Uddin,
Abstract summary: We developed a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.
Score: 0.38233569758620056
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.

Related papers

Why Do Developers Engage with ChatGPT in Issue-Tracker? Investigating Usage and Reliance on ChatGPT-Generated Code [4.605779671279481]
We analyzed 1,152 Developer-ChatGPT conversations across 1,012 issues in GitHub. ChatGPT is primarily utilized for ideation, whereas its usage for validation is minimal. ChatGPT-generated code was used as-is to resolve only 5.83% of the issues.
arXiv Detail & Related papers (2024-12-09T18:47:31Z)
A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating [14.113742357609285]
ChatGPT can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating.
arXiv Detail & Related papers (2024-02-21T23:51:06Z)
Evaluating ChatGPT as a Question Answering System: A Comprehensive Analysis and Comparison with Existing Models [0.0]
This article scrutinizes ChatGPT as a Question Answering System (QAS) The primary focus is on evaluating ChatGPT's proficiency in extracting responses from provided paragraphs. The evaluation highlights hallucinations, where ChatGPT provides responses to questions without available answers in the provided context.
arXiv Detail & Related papers (2023-12-11T08:49:18Z)
Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z)
Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z)
An empirical study of ChatGPT-3.5 on question answering and code maintenance [14.028497274245227]
A rising concern is whether ChatGPT will replace programmers and kill jobs. We conducted an empirical study to systematically compare ChatGPT against programmers in question-answering and software-maintaining.
arXiv Detail & Related papers (2023-10-03T14:48:32Z)
Fighting Fire with Fire: Can ChatGPT Detect AI-generated Text? [20.37071875344405]
We evaluate the zero-shot performance of ChatGPT in the task of human-written vs. AI-generated text detection. We empirically investigate if ChatGPT is symmetrically effective in detecting AI-generated or human-written text.
arXiv Detail & Related papers (2023-08-02T17:11:37Z)
ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status. Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features. We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.