Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers
- URL: http://arxiv.org/abs/2406.11339v1
- Date: Mon, 17 Jun 2024 08:55:56 GMT
- Title: Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers
- Authors: Francisco Gomes de Oliveira Neto,
- Abstract summary: Decision-making relies on a variety of information, including code, requirements specifications, and other software artifacts.
To fill in the gaps left by unclear information, we often rely on assumptions, intuition, or previous experiences to make decisions.
- Score: 2.5327705116230477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of Large Language Models (LLMs) and chatbots introduces new challenges and opportunities for decision-making in software testing. Decision-making relies on a variety of information, including code, requirements specifications, and other software artifacts that are often unclear or exist solely in the developer's mind. To fill in the gaps left by unclear information, we often rely on assumptions, intuition, or previous experiences to make decisions. This paper explores the potential of LLM-based chatbots like Bard, Copilot, and ChatGPT, to support software testers in test decisions such as prioritizing test cases effectively. We investigate whether LLM-based chatbots and human testers share similar "assumptions" or intuition in prohibitive testing scenarios where exhaustive execution of test cases is often impractical. Preliminary results from a survey of 127 testers indicate a preference for diverse test scenarios, with a significant majority (96%) favoring dissimilar test sets. Interestingly, two out of four chatbots mirrored this preference, aligning with human intuition, while the others opted for similar test scenarios, chosen by only 3.9% of testers. Our initial insights suggest a promising avenue within the context of enhancing the collaborative dynamics between testers and chatbots.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Analyzing Large language models chatbots: An experimental approach using a probability test [0.0]
This study consists of qualitative empirical research, conducted through exploratory tests with two different Large Language Models (LLMs)
The methodological procedure involved exploratory tests based on prompts designed with a probability question.
The "Linda Problem", widely recognized in cognitive psychology, was used as a basis to create the tests, along with the development of a new problem specifically for this experiment, the "Mary Problem"
arXiv Detail & Related papers (2024-07-10T15:49:40Z) - Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using
LLMs [30.024465480783835]
We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing.
Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing.
arXiv Detail & Related papers (2023-10-14T21:24:03Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Can a Chatbot Support Exploratory Software Testing? Preliminary Results [0.9249657468385781]
exploratory testing is the de facto approach in agile teams.
This paper presents BotExpTest, designed to support testers while performing exploratory tests of software applications.
We implemented BotExpTest on top of the instant messaging social platform Discord.
Preliminary analyses indicate that BotExpTest may be as effective as similar approaches and help testers to uncover different bugs.
arXiv Detail & Related papers (2023-07-11T21:11:21Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - AutoML Two-Sample Test [13.468660785510945]
We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power.
We provide an implementation of the AutoML two-sample test in the Python package autotst.
arXiv Detail & Related papers (2022-06-17T15:41:07Z) - TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning
Tasks [14.547623982073475]
Deep learning systems are notoriously difficult to test and debug.
It is essential to conduct test selection and label only those selected "high quality" bug-revealing test inputs for test cost reduction.
We propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank.
arXiv Detail & Related papers (2021-05-21T03:41:10Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.