Automated Robustness Testing for LLM-based NLP Software
- URL: http://arxiv.org/abs/2412.21016v1
- Date: Mon, 30 Dec 2024 15:33:34 GMT
- Title: Automated Robustness Testing for LLM-based NLP Software
- Authors: Mingxuan Xiao, Yan Xiao, Shunhui Ji, Hanbo Cai, Lei Xue, Pengcheng Zhang,
- Abstract summary: There are no known automated robustness testing methods specifically designed for LLM-based NLP software.
Existing testing methods can be applied to LLM-based software by AORTA, but their effectiveness is limited.
We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.
- Score: 6.986328098563149
- License:
- Abstract: Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. To our knowledge, there are no known automated robustness testing methods specifically designed for LLM-based NLP software. Given the complexity of LLMs and the unpredictability of real-world inputs (including prompts and examples), it is essential to examine the robustness of overall inputs to ensure the safety of such software. To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to LLM-based software by AORTA, but their effectiveness is limited. To address this, we propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking. We successfully embed 18 test methods in the designed framework AORTA and compared the test validity of ABS with three datasets and five threat models. ABS facilitates a more comprehensive and accurate robustness assessment before software deployment, with an average test success rate of 86.138%. Compared to the currently best-performing baseline PWWS, ABS significantly reduces the computational overhead by up to 3441.895 seconds per successful test case and decreases the number of queries by 218.762 times on average. Furthermore, test cases generated by ABS exhibit greater naturalness and transferability.
Related papers
- Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - The Potential of LLMs in Automating Software Testing: From Generation to Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.
Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.
This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv Detail & Related papers (2024-12-31T02:06:46Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - The Power of Resets in Online Reinforcement Learning [73.64852266145387]
We explore the power of simulators through online reinforcement learning with local simulator access (or, local planning)
We show that MDPs with low coverability can be learned in a sample-efficient fashion with only $Qstar$-realizability.
We show that the notorious Exogenous Block MDP problem is tractable under local simulator access.
arXiv Detail & Related papers (2024-04-23T18:09:53Z) - RITFIS: Robust input testing framework for LLMs-based intelligent
software [6.439196068684973]
RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
arXiv Detail & Related papers (2024-02-21T04:00:54Z) - Large Language Models Based Fuzzing Techniques: A Survey [4.155653485098873]
fuzzing test, as an efficient software testing method, are widely used in various domains.
The rapid development of Large Language Models (LLMs) has facilitated their application in the field of software testing.
There is a growing trend towards employing fuzzing test generated based on large language models.
arXiv Detail & Related papers (2024-02-01T05:34:03Z) - Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases.
LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices.
We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - LEAP: Efficient and Automated Test Method for NLP Software [6.439196068684973]
This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
arXiv Detail & Related papers (2023-08-22T08:51:10Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.