ABFS: Natural Robustness Testing for LLM-based NLP Software
- URL: http://arxiv.org/abs/2503.01319v1
- Date: Mon, 03 Mar 2025 09:02:06 GMT
- Title: ABFS: Natural Robustness Testing for LLM-based NLP Software
- Authors: Mingxuan Xiao, Yan Xiao, Shunhui Ji, Yunhe Li, Lei Xue, Pengcheng Zhang,
- Abstract summary: Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains.<n>These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs.<n>Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
- Score: 8.833542944724465
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Owing to the exceptional performance of Large Language Models (LLMs) in Natural Language Processing (NLP) tasks, LLM-based NLP software has rapidly gained traction across various domains, such as financial analysis and content moderation. However, these applications frequently exhibit robustness deficiencies, where slight perturbations in input (prompt+example) may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, limiting the applicability of LLM-based software in safety-critical scenarios, and (2) insufficient naturalness of test cases, reducing the practical value of testing outcomes. To address these issues, this paper proposes ABFS, a straightforward yet effective automated testing method that, for the first time, treats the input prompts and examples as a unified whole for robustness testing. Specifically, ABFS formulates the testing process as a combinatorial optimization problem, employing Best-First Search to identify successful test cases within the perturbation space and designing a novel Adaptive control strategy to enhance test case naturalness. We evaluate the robustness testing performance of ABFS on three datasets across five threat models. On Llama2-13b, the traditional StressTest achieves only a 13.273% success rate, while ABFS attains a success rate of 98.064%, supporting a more comprehensive robustness assessment before software deployment. Compared to baseline methods, ABFS introduces fewer modifications to the original input and consistently generates test cases with superior naturalness. Furthermore, test cases generated by ABFS exhibit stronger transferability and higher testing efficiency, significantly reducing testing costs.
Related papers
- Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.
Users pay for services based on advertised model capabilities.
providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.
This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z) - Boundary Value Test Input Generation Using Prompt Engineering with LLMs: Fault Detection and Coverage Analysis [3.249891166806818]
This paper presents a framework for assessing the effectiveness of large language models (LLMs) in generating boundary value test inputs for white-box software testing.<n>Our analysis shows the strengths and limitations of LLMs in boundary value generation, particularly in detecting common boundary-related issues.<n>This research provides insights into the role of LLMs in boundary value testing, underscoring both their potential and areas for improvement in automated testing methods.
arXiv Detail & Related papers (2025-01-24T12:54:19Z) - Automated Robustness Testing for LLM-based NLP Software [6.986328098563149]
There are no known automated robustness testing methods specifically designed for LLM-based NLP software.<n>Existing testing methods can be applied to LLM-based software by AORTA, but their effectiveness is limited.<n>We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.
arXiv Detail & Related papers (2024-12-30T15:33:34Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Fuzzy Inference System for Test Case Prioritization in Software Testing [0.0]
Test case prioritization ( TCP) is a vital strategy to enhance testing efficiency.
This paper introduces a novel fuzzy logic-based approach to automate TCP.
arXiv Detail & Related papers (2024-04-25T08:08:54Z) - Active Test-Time Adaptation: Theoretical Analyses and An Algorithm [51.84691955495693]
Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings.
We propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting.
arXiv Detail & Related papers (2024-04-07T22:31:34Z) - An empirical study of testing machine learning in the wild [35.13282520395855]
Machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems.
Due to their inductive nature, ensuring the quality of these systems remains a significant challenge for the research community.
Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability.
arXiv Detail & Related papers (2023-12-19T21:18:14Z) - Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity.
An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z) - LEAP: Efficient and Automated Test Method for NLP Software [6.439196068684973]
This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
arXiv Detail & Related papers (2023-08-22T08:51:10Z) - On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts.
We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.