Automated Robustness Testing for LLM-based NLP Software
        - URL: http://arxiv.org/abs/2412.21016v1
 - Date: Mon, 30 Dec 2024 15:33:34 GMT
 - Title: Automated Robustness Testing for LLM-based NLP Software
 - Authors: Mingxuan Xiao, Yan Xiao, Shunhui Ji, Hanbo Cai, Lei Xue, Pengcheng Zhang, 
 - Abstract summary: There are no known automated robustness testing methods specifically designed for LLM-based NLP software.<n>Existing testing methods can be applied to LLM-based software by AORTA, but their effectiveness is limited.<n>We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.
 - Score: 6.986328098563149
 - License: http://creativecommons.org/licenses/by-nc-nd/4.0/
 - Abstract:   Benefiting from the advancements in LLMs, NLP software has undergone rapid development. Such software is widely employed in various safety-critical tasks, such as financial sentiment analysis, toxic content moderation, and log generation. To our knowledge, there are no known automated robustness testing methods specifically designed for LLM-based NLP software. Given the complexity of LLMs and the unpredictability of real-world inputs (including prompts and examples), it is essential to examine the robustness of overall inputs to ensure the safety of such software.   To this end, this paper introduces the first AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a combinatorial optimization problem. Existing testing methods designed for DNN-based software can be applied to LLM-based software by AORTA, but their effectiveness is limited. To address this, we propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search. ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking.   We successfully embed 18 test methods in the designed framework AORTA and compared the test validity of ABS with three datasets and five threat models. ABS facilitates a more comprehensive and accurate robustness assessment before software deployment, with an average test success rate of 86.138%. Compared to the currently best-performing baseline PWWS, ABS significantly reduces the computational overhead by up to 3441.895 seconds per successful test case and decreases the number of queries by 218.762 times on average. Furthermore, test cases generated by ABS exhibit greater naturalness and transferability. 
 
       
      
        Related papers
        - Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse   Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv  Detail & Related papers  (2025-06-16T20:58:05Z) - Boosting Rust Unit Test Coverage through Hybrid Program Analysis and   Large Language Models [14.536415473544146]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 10 open-source Rust crates.
arXiv  Detail & Related papers  (2025-06-10T17:21:21Z) - Training Language Models to Generate Quality Code with Program Analysis   Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv  Detail & Related papers  (2025-05-28T17:57:47Z) - ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains.
These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs.
Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv  Detail & Related papers  (2025-03-03T09:02:06Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and   Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv  Detail & Related papers  (2025-01-24T06:39:38Z) - The Potential of LLMs in Automating Software Testing: From Generation to   Reporting [0.0]
Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods.
Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering.
This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency.
arXiv  Detail & Related papers  (2024-12-31T02:06:46Z) - Reinforcement Learning from Automatic Feedback for High-Quality Unit   Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for automated test case generation.<n>Because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices.<n>We propose Reinforcement Learning from Static Quality Metrics (RLSQM) to generate high-quality unit tests based on static analysis-based quality metrics.
arXiv  Detail & Related papers  (2024-12-18T20:20:01Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration   Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv  Detail & Related papers  (2024-11-02T13:24:30Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv  Detail & Related papers  (2024-10-04T04:03:24Z) - Benchmarking Uncertainty Quantification Methods for Large Language   Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv  Detail & Related papers  (2024-06-21T20:06:31Z) - The Power of Resets in Online Reinforcement Learning [73.64852266145387]
We explore the power of simulators through online reinforcement learning with local simulator access (or, local planning)
We show that MDPs with low coverability can be learned in a sample-efficient fashion with only $Qstar$-realizability.
We show that the notorious Exogenous Block MDP problem is tractable under local simulator access.
arXiv  Detail & Related papers  (2024-04-23T18:09:53Z) - RITFIS: Robust input testing framework for LLMs-based intelligent
  software [6.439196068684973]
RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
arXiv  Detail & Related papers  (2024-02-21T04:00:54Z) - Large Language Models Based Fuzzing Techniques: A Survey [4.155653485098873]
fuzzing test, as an efficient software testing method, are widely used in various domains.
The rapid development of Large Language Models (LLMs) has facilitated their application in the field of software testing.
There is a growing trend towards employing fuzzing test generated based on large language models.
arXiv  Detail & Related papers  (2024-02-01T05:34:03Z) - LEAP: Efficient and Automated Test Method for NLP Software [6.439196068684973]
This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
arXiv  Detail & Related papers  (2023-08-22T08:51:10Z) - Using Machine Learning To Identify Software Weaknesses From Software
  Requirement Specifications [49.1574468325115]
This research focuses on finding an efficient machine learning algorithm to identify software weaknesses from requirement specifications.
 Keywords extracted using latent semantic analysis help map the CWE categories to PROMISE_exp. Naive Bayes, support vector machine (SVM), decision trees, neural network, and convolutional neural network (CNN) algorithms were tested.
arXiv  Detail & Related papers  (2023-08-10T13:19:10Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
  Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv  Detail & Related papers  (2022-07-21T11:14:47Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.