White-box Testing of NLP models with Mask Neuron Coverage
- URL: http://arxiv.org/abs/2205.05050v1
- Date: Tue, 10 May 2022 17:07:23 GMT
- Title: White-box Testing of NLP models with Mask Neuron Coverage
- Authors: Arshdeep Sekhon, Yangfeng Ji, Matthew B. Dwyer, Yanjun Qi
- Abstract summary: We propose a set of white-box testing methods that are customized for transformer-based NLP models.
MNCOVER measures how thoroughly the attention layers in models are exercised during testing.
We show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
- Score: 30.508750085817717
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent literature has seen growing interest in using black-box strategies
like CheckList for testing the behavior of NLP models. Research on white-box
testing has developed a number of methods for evaluating how thoroughly the
internal behavior of deep models is tested, but they are not applicable to NLP
models. We propose a set of white-box testing methods that are customized for
transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that
measures how thoroughly the attention layers in models are exercised during
testing. We show that MNCOVER can refine testing suites generated by CheckList
by substantially reduce them in size, for more than 60\% on average, while
retaining failing tests -- thereby concentrating the fault detection power of
the test suite. Further we show how MNCOVER can be used to guide CheckList
input generation, evaluate alternative NLP testing methods, and drive data
augmentation to improve accuracy.
Related papers
- GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - GIST: Generated Inputs Sets Transferability in Deep Learning [12.147546375400749]
GIST (Generated Inputs Sets Transferability) is a novel approach for the efficient transfer of test sets.
This paper introduces GIST, a novel approach for the efficient transfer of test sets.
arXiv Detail & Related papers (2023-11-01T19:35:18Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Statistical and Computational Phase Transitions in Group Testing [73.55361918807883]
We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease.
We consider two different simple random procedures for assigning individuals tests.
arXiv Detail & Related papers (2022-06-15T16:38:50Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z) - Testing Monotonicity of Machine Learning Models [0.5330240017302619]
We propose verification-based testing of monotonicity, i.e., the formal computation of test inputs on a white-box model via verification technology.
On the white-box model, the space of test inputs can be systematically explored by a directed computation of test cases.
The empirical evaluation on 90 black-box models shows verification-based testing can outperform adaptive random testing as well as property-based techniques with respect to effectiveness and efficiency.
arXiv Detail & Related papers (2020-02-27T17:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.