BASFuzz: Towards Robustness Evaluation of LLM-based NLP Software via Automated Fuzz Testing
- URL: http://arxiv.org/abs/2509.17335v1
- Date: Mon, 22 Sep 2025 03:13:57 GMT
- Title: BASFuzz: Towards Robustness Evaluation of LLM-based NLP Software via Automated Fuzz Testing
- Authors: Mingxuan Xiao, Yan Xiao, Shunhui Ji, Jiahe Tu, Pengcheng Zhang,
- Abstract summary: BASFuzz is an efficient fuzz testing method tailored for large language model (LLM)-based NLP software.<n>A Beam-Annealing Search algorithm, which integrates beam search and simulated annealing, is employed to design an efficient fuzzing loop.<n>Experiments demonstrate that BASFuzz achieves a testing effectiveness of 90.335% while reducing the average time overhead by 2,163.852 seconds.
- Score: 8.893978269498524
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fuzzing has shown great success in evaluating the robustness of intelligent natural language processing (NLP) software. As large language model (LLM)-based NLP software is widely deployed in critical industries, existing methods still face two main challenges: 1 testing methods are insufficiently coupled with the behavioral patterns of LLM-based NLP software; 2 fuzzing capability for the testing scenario of natural language generation (NLG) generally degrades. To address these issues, we propose BASFuzz, an efficient Fuzz testing method tailored for LLM-based NLP software. BASFuzz targets complete test inputs composed of prompts and examples, and uses a text consistency metric to guide mutations of the fuzzing loop, aligning with the behavioral patterns of LLM-based NLP software. A Beam-Annealing Search algorithm, which integrates beam search and simulated annealing, is employed to design an efficient fuzzing loop. In addition, information entropy-based adaptive adjustment and an elitism strategy further enhance fuzzing capability. We evaluate BASFuzz on six datasets in representative scenarios of NLG and natural language understanding (NLU). Experimental results demonstrate that BASFuzz achieves a testing effectiveness of 90.335% while reducing the average time overhead by 2,163.852 seconds compared to the current best baseline, enabling more effective robustness evaluation prior to software deployment.
Related papers
- Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z) - LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models [7.171282546185869]
Large language models (LLMs) have revolutionized software testing, particularly fuzz testing, by automating the generation of diverse and effective test inputs.<n>MoJO is a high-performance AI programming language blending Python's usability with the efficiency of C and C++.<n>MoJOFuzzer is the first adaptive LLM-based fuzzing framework designed for zero-shot learning environments of emerging programming languages.
arXiv Detail & Related papers (2025-10-11T11:37:18Z) - SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z) - QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation [51.393569044134445]
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification.<n> Extending RLVR to automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges.<n>We introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs.
arXiv Detail & Related papers (2025-05-30T03:51:06Z) - ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains.<n>These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs.<n>Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z) - Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models [76.17975723711886]
Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs)<n>In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation.<n>Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores.
arXiv Detail & Related papers (2025-02-20T10:25:13Z) - Assessing the Robustness of LLM-based NLP Software via Automated Testing [6.986328098563149]
This paper introduces AutOmated Robustness Testing frAmework, AORTA, which reconceptualizes the testing process into a computational optimization problem.<n>We propose a novel testing method for LLM-based software within AORTA called Adaptive Beam Search.<n> ABS is tailored for the expansive feature space of LLMs and improves testing effectiveness through an adaptive beam width and the capability for backtracking.
arXiv Detail & Related papers (2024-12-30T15:33:34Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - RITFIS: Robust input testing framework for LLMs-based intelligent
software [6.439196068684973]
RITFIS is the first framework designed to assess the robustness of intelligent software against natural language inputs.
RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software.
It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation.
arXiv Detail & Related papers (2024-02-21T04:00:54Z) - LEAP: Efficient and Automated Test Method for NLP Software [6.439196068684973]
This paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases.
We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%.
While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other inertial-based methods.
arXiv Detail & Related papers (2023-08-22T08:51:10Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.