Towards Reliable LLM-Driven Fuzz Testing: Vision and Road Ahead
- URL: http://arxiv.org/abs/2503.00795v1
- Date: Sun, 02 Mar 2025 08:46:39 GMT
- Title: Towards Reliable LLM-Driven Fuzz Testing: Vision and Road Ahead
- Authors: Yiran Cheng, Hong Jin Kang, Lwin Khin Shar, Chaopeng Dong, Zhiqiang Shi, Shichao Lv, Limin Sun,
- Abstract summary: Large Language Models (LLMs) offer transformative potential for automating fuzz testing (LLM4Fuzz)<n>This paper aims to examine the reliability bottlenecks of LLM-driven fuzzing and explores potential research directions to address these limitations.
- Score: 7.059490893549601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fuzz testing is a crucial component of software security assessment, yet its effectiveness heavily relies on valid fuzz drivers and diverse seed inputs. Recent advancements in Large Language Models (LLMs) offer transformative potential for automating fuzz testing (LLM4Fuzz), particularly in generating drivers and seeds. However, current LLM4Fuzz solutions face critical reliability challenges, including low driver validity rates and seed quality trade-offs, hindering their practical adoption. This paper aims to examine the reliability bottlenecks of LLM-driven fuzzing and explores potential research directions to address these limitations. It begins with an overview of the current development of LLM4SE and emphasizes the necessity for developing reliable LLM4Fuzz solutions. Following this, the paper envisions a vision where reliable LLM4Fuzz transforms the landscape of software testing and security for industry, software development practitioners, and economic accessibility. It then outlines a road ahead for future research, identifying key challenges and offering specific suggestions for the researchers to consider. This work strives to spark innovation in the field, positioning reliable LLM4Fuzz as a fundamental component of modern software testing.
Related papers
- From Code to Courtroom: LLMs as the New Software Judges [29.77858458399232]
Large Language Models (LLMs) have been increasingly used to automate software engineering tasks such as code generation and summarization.
Human evaluation, while effective, is very costly and time-consuming.
The LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged.
arXiv Detail & Related papers (2025-03-04T03:48:23Z) - Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives [56.528835143531694]
We introduce DriveBench, a benchmark dataset designed to evaluate Vision-Language Models (VLMs)<n>Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding.<n>We propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding.
arXiv Detail & Related papers (2025-01-07T18:59:55Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Agent-Driven Automatic Software Improvement [55.2480439325792]
This research proposal aims to explore innovative solutions by focusing on the deployment of agents powered by Large Language Models (LLMs)
The iterative nature of agents, which allows for continuous learning and adaptation, can help surpass common challenges in code generation.
We aim to use the iterative feedback in these systems to further fine-tune the LLMs underlying the agents, becoming better aligned to the task of automated software improvement.
arXiv Detail & Related papers (2024-06-24T15:45:22Z) - MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study [72.24266814625685]
We explore the performance of large language models (LLMs) across the entire software development lifecycle with DevEval.<n>DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task.<n> Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics [54.57914943017522]
We highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications.
arXiv Detail & Related papers (2024-02-15T22:01:45Z) - Large Language Models Based Fuzzing Techniques: A Survey [4.155653485098873]
fuzzing test, as an efficient software testing method, are widely used in various domains.
The rapid development of Large Language Models (LLMs) has facilitated their application in the field of software testing.
There is a growing trend towards employing fuzzing test generated based on large language models.
arXiv Detail & Related papers (2024-02-01T05:34:03Z) - LLM4Fuzz: Guided Fuzzing of Smart Contracts with Large Language Models [7.833199151422389]
This paper introduces LLM4Fuzz to optimize automated smart contract security analysis.
It uses large language models (LLMs) to intelligently guide and prioritize fuzzing campaigns.
Evaluations show substantial gains in efficiency, coverage, and vulnerability detection.
arXiv Detail & Related papers (2024-01-20T04:07:53Z) - How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation [31.77886516971502]
This study is the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers.
Our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens)
Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.
arXiv Detail & Related papers (2023-07-24T01:49:05Z) - Software Testing with Large Language Models: Survey, Landscape, and
Vision [32.34617250991638]
Pre-trained large language models (LLMs) have emerged as a breakthrough technology in natural language processing and artificial intelligence.
This paper provides a comprehensive review of the utilization of LLMs in software testing.
arXiv Detail & Related papers (2023-07-14T08:26:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.