Assessing the Robustness of LLM-based NLP Software via Automated Testing
- URL: http://arxiv.org/abs/2412.21016v2
- Date: Mon, 17 Mar 2025 13:42:06 GMT
- Title: Assessing the Robustness of LLM-based NLP Software via Automated Testing
- Authors: Mingxuan Xiao, Yan Xiao, Shunhui Ji, Hanbo Cai, Lei Xue, Pengcheng Zhang,
- Abstract summary: This paper introduces AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a computational optimization problem.<n>We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.<n> ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking.
- Score: 6.986328098563149
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. Unlike traditional software, LLM-based NLP software relies on prompts and examples as inputs. Given the complexity of LLMs and the unpredictability of real-world inputs, quantitatively assessing the robustness of such software is crucial. However, to the best of our knowledge, no automated robustness testing methods have been specifically designed to evaluate the overall inputs of LLM-based NLP software. To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to LLM-based software by AORTA, but their effectiveness is limited. To address this, we propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking. We successfully embed 18 test methods in the designed framework AORTA and compared the test validity of ABS with three datasets and five threat models. ABS facilitates a more comprehensive and accurate robustness assessment before software deployment, with an average test success rate of 86.138%. Compared to the currently best-performing baseline PWWS, ABS significantly reduces the computational overhead by up to 3441.895 seconds per successful test case and decreases the number of queries by 218.762 times on average. Furthermore, test cases generated by ABS exhibit greater naturalness and transferability.
Related papers
- ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains.
These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs.
Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - The Potential of LLMs in Automating Software Testing: From Generation to Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.
Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.
This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv Detail & Related papers (2024-12-31T02:06:46Z) - Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for automated test case generation.<n>Because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices.<n>We propose Reinforcement Learning from Static Quality Metrics (RLSQM) to generate high-quality unit tests based on static analysis-based quality metrics.
arXiv Detail & Related papers (2024-12-18T20:20:01Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - The Power of Resets in Online Reinforcement Learning [73.64852266145387]
We explore the power of simulators through online reinforcement learning with local simulator access (or, local planning)
We show that MDPs with low coverability can be learned in a sample-efficient fashion with only $Qstar$-realizability.
We show that the notorious Exogenous Block MDP problem is tractable under local simulator access.
arXiv Detail & Related papers (2024-04-23T18:09:53Z) - RITFIS: Robust input testing framework for LLMs-based intelligent
software [6.439196068684973]
RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
arXiv Detail & Related papers (2024-02-21T04:00:54Z) - Large Language Models Based Fuzzing Techniques: A Survey [4.155653485098873]
fuzzing test, as an efficient software testing method, are widely used in various domains.
The rapid development of Large Language Models (LLMs) has facilitated their application in the field of software testing.
There is a growing trend towards employing fuzzing test generated based on large language models.
arXiv Detail & Related papers (2024-02-01T05:34:03Z) - LEAP: Efficient and Automated Test Method for NLP Software [6.439196068684973]
This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
arXiv Detail & Related papers (2023-08-22T08:51:10Z) - Using Machine Learning To Identify Software Weaknesses From Software
Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv Detail & Related papers (2023-08-10T13:19:10Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.