Related papers: LEAP: Efficient and Automated Test Method for NLP Software

LEAP: Efficient and Automated Test Method for NLP Software

URL: http://arxiv.org/abs/2308.11284v1
Date: Tue, 22 Aug 2023 08:51:10 GMT
Title: LEAP: Efficient and Automated Test Method for NLP Software
Authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, Pengcheng Zhang
Abstract summary: This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%. While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
Score: 6.439196068684973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread adoption of DNNs in NLP software has highlighted the need for robustness. Researchers proposed various automatic testing techniques for adversarial test cases. However, existing methods suffer from two limitations: weak error-discovering capabilities, with success rates ranging from 0% to 24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to 205.28s per test case, making them challenging for time-constrained scenarios. To address these issues, this paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. Specifically, we adopt Levy flight for population initialization to increase the diversity of generated test cases. We also design an inertial weight adaptive update operator to improve the efficiency of LEAP's global optimization of high-dimensional text examples and a mutation operator based on the greedy strategy to reduce the search time. We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1% higher than the next best approach (PSOattack). While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other heuristic-based methods. Additionally, the experimental results demonstrate that LEAP can generate more transferable test cases and significantly enhance the robustness of DNN-based systems.

Related papers

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains. These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z)
Assessing the Robustness of LLM-based NLP Software via Automated Testing [6.986328098563149]
This paper introduces AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a computational optimization problem. We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking.
arXiv Detail & Related papers (2024-12-30T15:33:34Z)
PromptV: Leveraging LLM-powered Multi-Agent Prompting for High-quality Verilog Generation [9.990225157705966]
This paper proposes a novel multi-agent prompt learning framework to address limitations and enhance code generation quality. We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities.
arXiv Detail & Related papers (2024-12-15T01:58:10Z)
Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection [35.88667386998423]
We introduce adaptive learn-then-test (aLTT), which provides finite-sample statistical guarantees on the population risk of AI models. ALTT can reduce the number of testing rounds, making it well-suited for scenarios in which testing is costly or presents safety risks.
arXiv Detail & Related papers (2024-09-24T08:14:26Z)
Skill-Adpative Imitation Learning for UI Test Reuse [13.538724823517292]
We propose a skill-adaptive imitation learning framework designed to enhance the effectiveness of UI test migration. Results show that SAIL substantially improves the effectiveness of UI test migration, with 149% higher success rate than state-of-the-art approaches.
arXiv Detail & Related papers (2024-09-20T08:13:04Z)
On Speeding Up Language Model Evaluation [48.51924035873411]
Development of prompt-based methods with Large Language Models (LLMs) requires making numerous decisions. We propose a novel method to address this challenge. We show that it can identify the top-performing method using only 5-15% of the typically needed resources.
arXiv Detail & Related papers (2024-07-08T17:48:42Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Active Test-Time Adaptation: Theoretical Analyses and An Algorithm [51.84691955495693]
Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. We propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting.
arXiv Detail & Related papers (2024-04-07T22:31:34Z)
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [13.658632458850144]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices. We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z)
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing [13.743062498008555]
We introduce MuTAP for improving the effectiveness of test cases generated by Large Language Models (LLMs) in terms of revealing bugs. MuTAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs) Our results show that our proposed method is able to detect up to 28% more faulty human-written code snippets.
arXiv Detail & Related papers (2023-08-31T08:48:31Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)
Efficient and Effective Generation of Test Cases for Pedestrian Detection -- Search-based Software Testing of Baidu Apollo in SVL [14.482670650074885]
This paper presents a study on testing pedestrian detection and emergency braking system of the Baidu Apollo autonomous driving platform within the SVL simulator. We propose an evolutionary automated test generation technique that generates failure-revealing scenarios for Apollo in the SVL environment. In order to demonstrate the efficiency and effectiveness of our approach, we also report the results from a baseline random generation technique.
arXiv Detail & Related papers (2021-09-16T13:11:53Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.