AEON: A Method for Automatic Evaluation of NLP Test Cases
- URL: http://arxiv.org/abs/2205.06439v1
- Date: Fri, 13 May 2022 03:47:13 GMT
- Title: AEON: A Method for Automatic Evaluation of NLP Test Cases
- Authors: Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su,
Michael R. Lyu
- Abstract summary: We use AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks.
AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%.
AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%.
- Score: 37.71980769922552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the labor-intensive nature of manual test oracle construction, various
automated testing techniques have been proposed to enhance the reliability of
Natural Language Processing (NLP) software. In theory, these techniques mutate
an existing test case (e.g., a sentence with its label) and assume the
generated one preserves an equivalent or similar semantic meaning and thus, the
same label. However, in practice, many of the generated test cases fail to
preserve similar semantic meaning and are unnatural (e.g., grammar errors),
which leads to a high false alarm rate and unnatural test cases. Our evaluation
study finds that 44% of the test cases generated by the state-of-the-art (SOTA)
approaches are false alarms. These test cases require extensive manual checking
effort, and instead of improving NLP software, they can even degrade NLP
software when utilized in model training. To address this problem, we propose
AEON for Automatic Evaluation Of NLP test cases. For each generated test case,
it outputs scores based on semantic similarity and language naturalness. We
employ AEON to evaluate test cases generated by four popular testing techniques
on five datasets across three typical NLP tasks. The results show that AEON
aligns the best with human judgment. In particular, AEON achieves the best
average precision in detecting semantic inconsistent test cases, outperforming
the best baseline metric by 10%. In addition, AEON also has the highest average
precision of finding unnatural test cases, surpassing the baselines by more
than 15%. Moreover, model training with test cases prioritized by AEON leads to
models that are more accurate and robust, demonstrating AEON's potential in
improving NLP software.
Related papers
- From Requirements to Test Cases: An NLP-Based Approach for High-Performance ECU Test Case Automation [0.5249805590164901]
This study investigates the use of Natural Language Processing techniques to transform natural language requirements into structured test case specifications.<n>A dataset of 400 feature element documents was used to evaluate both approaches for extracting key elements such as signal names and values.<n>The Rule-Based method outperforms the NER method, achieving 95% accuracy for more straightforward requirements with single signals.
arXiv Detail & Related papers (2025-05-01T14:23:55Z) - AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models [11.958545255487735]
We introduce AutoTestForge, an automated and multidimensional testing framework for NLP models.
Within AutoTestForge, through the utilization of Large Language Models (LLMs) to automatically generate test templates and instantiate them, manual involvement is significantly reduced.
The framework also extends the test suite across three dimensions, taxonomy, fairness, and robustness, offering a comprehensive evaluation of the capabilities of NLP models.
arXiv Detail & Related papers (2025-03-07T02:44:17Z) - ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains.
These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs.
Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z) - VALTEST: Automated Validation of Language Model Generated Test Cases [0.7059472280274008]
Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases.
This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities.
arXiv Detail & Related papers (2024-11-13T00:07:32Z) - Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Intergenerational Test Generation for Natural Language Processing
Applications [16.63835131985415]
We propose an automated test generation method for detecting erroneous behaviors of various NLP applications.
We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences.
NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks.
arXiv Detail & Related papers (2023-02-21T07:57:59Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Labeling-Free Comparison Testing of Deep Learning Models [28.47632100019289]
We propose a labeling-free comparison testing approach to overcome the limitations of labeling effort and sampling randomness.
Our approach outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $tau$, regardless of the dataset and distribution shift.
arXiv Detail & Related papers (2022-04-08T10:55:45Z) - TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning
Tasks [14.547623982073475]
Deep learning systems are notoriously difficult to test and debug.
It is essential to conduct test selection and label only those selected "high quality" bug-revealing test inputs for test cost reduction.
We propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank.
arXiv Detail & Related papers (2021-05-21T03:41:10Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.