TestAug: A Framework for Augmenting Capability-based NLP Tests
- URL: http://arxiv.org/abs/2210.08097v1
- Date: Fri, 14 Oct 2022 20:42:16 GMT
- Title: TestAug: A Framework for Augmenting Capability-based NLP Tests
- Authors: Guanqun Yang, Mirazul Haque, Qiaochu Song, Wei Yang, Xueqing Liu
- Abstract summary: capability-based NLP testing allows model developers to test the functional capabilities of NLP models.
Existing work on capability-based testing requires extensive manual efforts and domain expertise in creating the test cases.
In this paper, we investigate a low-cost approach for the test case generation by leveraging the GPT-3 engine.
- Score: 6.418039698186639
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recently proposed capability-based NLP testing allows model developers to
test the functional capabilities of NLP models, revealing functional failures
that cannot be detected by the traditional heldout mechanism. However, existing
work on capability-based testing requires extensive manual efforts and domain
expertise in creating the test cases. In this paper, we investigate a low-cost
approach for the test case generation by leveraging the GPT-3 engine. We
further propose to use a classifier to remove the invalid outputs from GPT-3
and expand the outputs into templates to generate more test cases. Our
experiments show that TestAug has three advantages over the existing work on
behavioral testing: (1) TestAug can find more bugs than existing work; (2) The
test cases in TestAug are more diverse; and (3) TestAug largely saves the
manual efforts in creating the test suites. The code and data for TestAug can
be found at our project website (https://guanqun-yang.github.io/testaug/) and
GitHub (https://github.com/guanqun-yang/testaug).
Related papers
- VALTEST: Automated Validation of Language Model Generated Test Cases [0.7059472280274008]
Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases.
This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities.
arXiv Detail & Related papers (2024-11-13T00:07:32Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - Observation-based unit test generation at Meta [52.4716552057909]
TestGen automatically generates unit tests, carved from serialized observations of complex objects, observed during app execution.
TestGen has landed 518 tests into production, which have been executed 9,617,349 times in continuous integration, finding 5,702 faults.
Our evaluation reveals that, when carving its observations from 4,361 reliable end-to-end tests, TestGen was able to generate tests for at least 86% of the classes covered by end-to-end tests.
arXiv Detail & Related papers (2024-02-09T00:34:39Z) - Unit Test Generation using Generative AI : A Comparative Performance
Analysis of Autogeneration Tools [2.0686733932673604]
This research aims to experimentally investigate the effectiveness of Large Language Models (LLMs) for generating unit test scripts for Python programs.
For experiments, we consider three types of code units: 1) Procedural scripts, 2) Function-based modular code, and 3) Class-based code.
Our results show that ChatGPT's performance is comparable with Pynguin in terms of coverage, though for some cases its performance is superior to Pynguin.
arXiv Detail & Related papers (2023-12-17T06:38:11Z) - Can You Rely on Your Model Evaluation? Improving Model Evaluation with
Synthetic Test Data [75.20035991513564]
We introduce 3S Testing, a deep generative modeling framework to facilitate model evaluation.
Our experiments demonstrate that 3S Testing outperforms traditional baselines.
These results raise the question of whether we need a paradigm shift away from limited real test data towards synthetic test data.
arXiv Detail & Related papers (2023-10-25T10:18:44Z) - Beyond Testers' Biases: Guiding Model Testing with Knowledge Bases using
LLMs [30.024465480783835]
We propose Weaver, an interactive tool that supports requirements elicitation for guiding model testing.
Weaver uses large language models to generate knowledge bases and recommends concepts from them interactively, allowing testers to elicit requirements for further testing.
arXiv Detail & Related papers (2023-10-14T21:24:03Z) - Automatic Generation of Test Cases based on Bug Reports: a Feasibility
Study with Large Language Models [4.318319522015101]
Existing approaches produce test cases that either can be qualified as simple (e.g. unit tests) or that require precise specifications.
Most testing procedures still rely on test cases written by humans to form test suites.
We investigate the feasibility of performing this generation by leveraging large language models (LLMs) and using bug reports as inputs.
arXiv Detail & Related papers (2023-10-10T05:30:12Z) - Towards Automatic Generation of Amplified Regression Test Oracles [44.45138073080198]
We propose a test oracle derivation approach to amplify regression test oracles.
The approach monitors the object state during test execution and compares it to the previous version to detect any changes in relation to the SUT's intended behaviour.
arXiv Detail & Related papers (2023-07-28T12:38:44Z) - No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation [11.009117714870527]
Unit testing is essential in detecting bugs in functionally-discrete program units.
Recent work has shown the large potential of large language models (LLMs) in unit test generation.
It remains unclear how effective ChatGPT is in unit test generation.
arXiv Detail & Related papers (2023-05-07T07:17:08Z) - BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models [73.29106813131818]
bias testing is currently cumbersome since the test sentences are generated from a limited set of manual templates or need expensive crowd-sourcing.
We propose using ChatGPT for the controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes.
We present an open-source comprehensive bias testing framework (BiasTestGPT), hosted on HuggingFace, that can be plugged into any open-source PLM for bias testing.
arXiv Detail & Related papers (2023-02-14T22:07:57Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.