Related papers: May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs

URL: http://arxiv.org/abs/2506.17642v1
Date: Sat, 21 Jun 2025 08:51:53 GMT
Title: May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs
Authors: Shaoyu Yang, Chunrong Fang, Haifeng Lin, Xiang Chen, Zhenyu Chen,
Abstract summary: A simple yet effective way to find bugs in Deep Learning (DL) frameworks is fuzz testing (Fuzzing)<n>We propose FUEL to break the seal of Feedback-driven fuzzing for DL frameworks.<n> FUEL has detected 104 bugs for PyTorch and summaries, with 93 confirmed as new bugs, 47 already fixed, and 5 assigned with CVE IDs.
Score: 13.976286931563006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Artificial Intelligence (AI) Infrastructures, represented by Deep Learning (DL) frameworks, have served as fundamental DL systems over the last decade. However, the bugs in DL frameworks could lead to catastrophic consequences in some critical scenarios (e.g., healthcare and autonomous driving). A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Unfortunately, existing fuzzing techniques have not comprehensively considered multiple types of feedback. Additionally, they analyze feedback in a coarse-grained manner, such as mutating the test cases only according to whether the coverage increases. Recently, researchers introduced Large Language Models (LLMs) into fuzzing. However, current LLM-based fuzzing techniques only focus on using LLMs to generate test cases while overlooking their potential to analyze feedback information, failing to create more valid and diverse test cases. To fill this gap, we propose FUEL to break the seal of Feedback-driven fuzzing for DL frameworks. The backbone of FUEL comprises two LLM-based agents, namely analysis LLM and generation LLM. Analysis LLM agent infers analysis summaries from feedback information, while the generation LLM agent creates tests guided by these analysis summaries. So far, FUEL has detected 104 bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 47 already fixed, and 5 assigned with CVE IDs. Our work indicates that considering multiple types of feedback is beneficial to fuzzing performance, and leveraging LLMs to analyze feedback information is a promising direction. Our artifact is available at https://github.com/NJU-iSE/FUEL

Related papers

Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback [11.070932612938154]
We develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings.<n>We provide feedback to the agent through static analysis and test failures so it can refine its solution.<n>In a three month period, 80% of the generated fixes were reviewed, of which 31.5% were landed.
arXiv Detail & Related papers (2025-07-24T19:12:32Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.<n>Traditional fuzzing struggles with the complexity and API diversity of DL libraries.<n>We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning [68.57166425493283]
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions.<n>This crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered.<n>We introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning (CRaFT) to address this issue.
arXiv Detail & Related papers (2024-10-09T14:12:51Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing [6.042114639413868]
Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput. In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data. Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing.
arXiv Detail & Related papers (2024-06-11T20:48:28Z)
Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
LLMDFA: Analyzing Dataflow in Code with Large Language Models [8.92611389987991]
This paper presents LLMDFA, a compilation-free and customizable dataflow analysis framework. We decompose the problem into several subtasks and introduce a series of novel strategies. On average, LLMDFA achieves 87.10% precision and 80.77% recall, surpassing existing techniques with F1 score improvements of up to 0.35.
arXiv Detail & Related papers (2024-02-16T15:21:35Z)
Feedback Loops With Language Models Drive In-Context Reward Hacking [78.9830398771605]
We show that feedback loops can cause in-context reward hacking (ICRH) We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. As AI development accelerates, the effects of feedback loops will proliferate.
arXiv Detail & Related papers (2024-02-09T18:59:29Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models [18.026567399243]
Large Language Models (LLMs) offer a promising alternative to static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis. We develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM.
arXiv Detail & Related papers (2023-08-01T02:57:43Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.