AEON: A Method for Automatic Evaluation of NLP Test Cases
- URL: http://arxiv.org/abs/2205.06439v1
- Date: Fri, 13 May 2022 03:47:13 GMT
- Title: AEON: A Method for Automatic Evaluation of NLP Test Cases
- Authors: Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su,
Michael R. Lyu
- Abstract summary: We use AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks.
AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%.
AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%.
- Score: 37.71980769922552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the labor-intensive nature of manual test oracle construction, various
automated testing techniques have been proposed to enhance the reliability of
Natural Language Processing (NLP) software. In theory, these techniques mutate
an existing test case (e.g., a sentence with its label) and assume the
generated one preserves an equivalent or similar semantic meaning and thus, the
same label. However, in practice, many of the generated test cases fail to
preserve similar semantic meaning and are unnatural (e.g., grammar errors),
which leads to a high false alarm rate and unnatural test cases. Our evaluation
study finds that 44% of the test cases generated by the state-of-the-art (SOTA)
approaches are false alarms. These test cases require extensive manual checking
effort, and instead of improving NLP software, they can even degrade NLP
software when utilized in model training. To address this problem, we propose
AEON for Automatic Evaluation Of NLP test cases. For each generated test case,
it outputs scores based on semantic similarity and language naturalness. We
employ AEON to evaluate test cases generated by four popular testing techniques
on five datasets across three typical NLP tasks. The results show that AEON
aligns the best with human judgment. In particular, AEON achieves the best
average precision in detecting semantic inconsistent test cases, outperforming
the best baseline metric by 10%. In addition, AEON also has the highest average
precision of finding unnatural test cases, surpassing the baselines by more
than 15%. Moreover, model training with test cases prioritized by AEON leads to
models that are more accurate and robust, demonstrating AEON's potential in
improving NLP software.
Related papers
- Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity.
An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective [63.92197404447808]
Large language models (LLMs) have shown some human-like cognitive abilities.
We propose an adaptive testing framework for LLM evaluation.
This approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Intergenerational Test Generation for Natural Language Processing
Applications [16.63835131985415]
We propose an automated test generation method for detecting erroneous behaviors of various NLP applications.
We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences.
NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks.
arXiv Detail & Related papers (2023-02-21T07:57:59Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Labeling-Free Comparison Testing of Deep Learning Models [28.47632100019289]
We propose a labeling-free comparison testing approach to overcome the limitations of labeling effort and sampling randomness.
Our approach outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $tau$, regardless of the dataset and distribution shift.
arXiv Detail & Related papers (2022-04-08T10:55:45Z) - TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning
Tasks [14.547623982073475]
Deep learning systems are notoriously difficult to test and debug.
It is essential to conduct test selection and label only those selected "high quality" bug-revealing test inputs for test cost reduction.
We propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank.
arXiv Detail & Related papers (2021-05-21T03:41:10Z) - Active Testing: Sample-Efficient Model Evaluation [39.200332879659456]
We introduce active testing: a new framework for sample-efficient model evaluation.
Active testing addresses this by carefully selecting the test points to label.
We show how to remove that bias while reducing the variance of the estimator.
arXiv Detail & Related papers (2021-03-09T10:20:49Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.