Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using
LLMs
- URL: http://arxiv.org/abs/2310.09668v1
- Date: Sat, 14 Oct 2023 21:24:03 GMT
- Title: Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using
LLMs
- Authors: Chenyang Yang, Rishabh Rustogi, Rachel Brower-Sinning, Grace A. Lewis,
Christian K\"astner, Tongshuang Wu
- Abstract summary: We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing.
Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing.
- Score: 30.024465480783835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current model testing work has mostly focused on creating test cases.
Identifying what to test is a step that is largely ignored and poorly
supported. We propose Weaver, an interactive tool that supports requirements
elicitation for guiding model testing. Weaver uses large language models to
generate knowledge bases and recommends concepts from them interactively,
allowing testers to elicit requirements for further testing. Weaver provides
rich external knowledge to testers and encourages testers to systematically
explore diverse concepts beyond their own biases. In a user study, we show that
both NLP experts and non-experts identified more, as well as more diverse
concepts worth testing when using Weaver. Collectively, they found more than
200 failing test cases for stance detection with zero-shot ChatGPT. Our case
studies further show that Weaver can help practitioners test models in
real-world settings, where developers define more nuanced application scenarios
(e.g., code understanding and transcript summarization) using LLMs.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers [2.5327705116230477]
Decision-making relies on a variety of information, including code, requirements specifications, and other software artifacts.
To fill in the gaps left by unclear information, we often rely on assumptions, intuition, or previous experiences to make decisions.
arXiv Detail & Related papers (2024-06-17T08:55:56Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems.
We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework.
Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z) - Towards Autonomous Testing Agents via Conversational Large Language
Models [18.302956037305112]
Large language models (LLMs) can be used as automated testing assistants.
We present a taxonomy of LLM-based testing agents based on their level of autonomy.
arXiv Detail & Related papers (2023-06-08T12:22:38Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - TestAug: A Framework for Augmenting Capability-based NLP Tests [6.418039698186639]
capability-based NLP testing allows model developers to test the functional capabilities of NLP models.
Existing work on capability-based testing requires extensive manual efforts and domain expertise in creating the test cases.
In this paper, we investigate a low-cost approach for the test case generation by leveraging the GPT-3 engine.
arXiv Detail & Related papers (2022-10-14T20:42:16Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Evaluating Models' Local Decision Boundaries via Contrast Sets [119.38387782979474]
We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data.
We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets.
Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets.
arXiv Detail & Related papers (2020-04-06T14:47:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.