LEAP: Efficient and Automated Test Method for NLP Software
- URL: http://arxiv.org/abs/2308.11284v1
- Date: Tue, 22 Aug 2023 08:51:10 GMT
- Title: LEAP: Efficient and Automated Test Method for NLP Software
- Authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, Pengcheng Zhang
- Abstract summary: This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
- Score: 6.439196068684973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread adoption of DNNs in NLP software has highlighted the need for
robustness. Researchers proposed various automatic testing techniques for
adversarial test cases. However, existing methods suffer from two limitations:
weak error-discovering capabilities, with success rates ranging from 0% to
24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to
205.28s per test case, making them challenging for time-constrained scenarios.
To address these issues, this paper proposes LEAP, an automated test method
that uses LEvy flight-based Adaptive Particle swarm optimization integrated
with textual features to generate adversarial test cases. Specifically, we
adopt Levy flight for population initialization to increase the diversity of
generated test cases. We also design an inertial weight adaptive update
operator to improve the efficiency of LEAP's global optimization of
high-dimensional text examples and a mutation operator based on the greedy
strategy to reduce the search time. We conducted a series of experiments to
validate LEAP's ability to test NLP software and found that the average success
rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1%
higher than the next best approach (PSOattack). While ensuring high success
rates, LEAP significantly reduces time overhead by up to 147.6s compared to
other heuristic-based methods. Additionally, the experimental results
demonstrate that LEAP can generate more transferable test cases and
significantly enhance the robustness of DNN-based systems.
Related papers
- Automated Robustness Testing for LLM-based NLP Software [6.986328098563149]
There are no known automated robustness testing methods specifically designed for LLM-based NLP software.
Existing testing methods can be applied to LLM-based software by AORTA, but their effectiveness is limited.
We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.
arXiv Detail & Related papers (2024-12-30T15:33:34Z) - PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation [9.990225157705966]
This paper proposes a novel multi-agent prompt learning framework to address limitations and enhance code generation quality.
We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities.
arXiv Detail & Related papers (2024-12-15T01:58:10Z) - Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection [36.407171992845456]
We introduce adaptive learn-then-test (aLTT) to provide finite-sample statistical guarantees on the population risk of AI models.
Unlike the existing learn-then-test (LTT) technique, aLTT implements sequential data-dependent multiple hypothesis testing (MHT) with early termination by leveraging e-processes.
arXiv Detail & Related papers (2024-09-24T08:14:26Z) - Skill-Adpative Imitation Learning for UI Test Reuse [13.538724823517292]
We propose a skill-adaptive imitation learning framework designed to enhance the effectiveness of UI test migration.
Results show that SAIL substantially improves the effectiveness of UI test migration, with 149% higher success rate than state-of-the-art approaches.
arXiv Detail & Related papers (2024-09-20T08:13:04Z) - On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions.
We propose a novel method to address this challenge.
We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Active Test-Time Adaptation: Theoretical Analyses and An Algorithm [51.84691955495693]
Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings.
We propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting.
arXiv Detail & Related papers (2024-04-07T22:31:34Z) - Effective Test Generation Using Pre-trained Large Language Models and
Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs.
MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs)
Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning
and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency.
We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question.
Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.