ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis
- URL: http://arxiv.org/abs/2401.13924v1
- Date: Thu, 25 Jan 2024 03:42:17 GMT
- Title: ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis
- Authors: Hiroyuki Kirinuki, Haruto Tanno
- Abstract summary: ChatGPT can generate test cases that generally match or slightly surpass those created by human participants.
When ChatGPT cooperates with humans, it can cover considerably more test viewpoints than each can achieve alone.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large language models (LLMs), such as ChatGPT, have been
pivotal in advancing various artificial intelligence applications, including
natural language processing and software engineering. A promising yet
underexplored area is utilizing LLMs in software testing, particularly in
black-box testing. This paper explores the test cases devised by ChatGPT in
comparison to those created by human participants. In this study, ChatGPT
(GPT-4) and four participants each created black-box test cases for three
applications based on specifications written by the authors. The goal was to
evaluate the real-world applicability of the proposed test cases, identify
potential shortcomings, and comprehend how ChatGPT could enhance human testing
strategies. ChatGPT can generate test cases that generally match or slightly
surpass those created by human participants in terms of test viewpoint
coverage. Additionally, our experiments demonstrated that when ChatGPT
cooperates with humans, it can cover considerably more test viewpoints than
each can achieve alone, suggesting that collaboration between humans and
ChatGPT may be more effective than human pairs working together. Nevertheless,
we noticed that the test cases generated by ChatGPT have certain issues that
require addressing before use.
Related papers
- Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer.
We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z) - Inappropriate Benefits and Identification of ChatGPT Misuse in
Programming Tests: A Controlled Experiment [0.0]
Students can ask ChatGPT to complete a programming task, generating a solution from other people's work without proper acknowledgment of the source(s)
We performed a controlled experiment measuring the inappropriate benefits of using ChatGPT in terms of completion time and programming performance.
arXiv Detail & Related papers (2023-08-11T06:42:29Z) - ChatGPT: A Study on its Utility for Ubiquitous Software Engineering
Tasks [2.084078990567849]
ChatGPT (Chat Generative Pre-trained Transformer) launched by OpenAI on November 30, 2022.
In this study, we explore how ChatGPT can be used to help with common software engineering tasks.
arXiv Detail & Related papers (2023-05-26T11:29:06Z) - No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation [11.009117714870527]
Unit testing is essential in detecting bugs in functionally-discrete program units.
Recent work has shown the large potential of large language models (LLMs) in unit test generation.
It remains unclear how effective ChatGPT is in unit test generation.
arXiv Detail & Related papers (2023-05-07T07:17:08Z) - ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora.
The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels.
The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation,
and Detection [8.107721810172112]
ChatGPT is able to respond effectively to a wide range of human questions.
People are starting to worry about the potential negative impacts that large language models (LLMs) like ChatGPT could have on society.
In this work, we collected tens of thousands of comparison responses from both human experts and ChatGPT.
arXiv Detail & Related papers (2023-01-18T15:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.