MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing
- URL: http://arxiv.org/abs/2510.15690v1
- Date: Fri, 17 Oct 2025 14:34:00 GMT
- Title: MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing
- Authors: Shiwen Ou, Yuwei Li, Lu Yu, Chengkun Wei, Tingke Wen, Qiangpu Chen, Yu Chen, Haizhi Tang, Zulie Pan,
- Abstract summary: Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications.<n> bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security.<n>We propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks.
- Score: 12.907848248343003
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deep learning (DL) frameworks serve as the backbone for a wide range of artificial intelligence applications. However, bugs within DL frameworks can cascade into critical issues in higher-level applications, jeopardizing reliability and security. While numerous techniques have been proposed to detect bugs in DL frameworks, research exploring common API patterns across frameworks and the potential risks they entail remains limited. Notably, many DL frameworks expose similar APIs with overlapping input parameters and functionalities, rendering them vulnerable to shared bugs, where a flaw in one API may extend to analogous APIs in other frameworks. To address this challenge, we propose MirrorFuzz, an automated API fuzzing solution to discover shared bugs in DL frameworks. MirrorFuzz operates in three stages: First, MirrorFuzz collects historical bug data for each API within a DL framework to identify potentially buggy APIs. Second, it matches each buggy API in a specific framework with similar APIs within and across other DL frameworks. Third, it employs large language models (LLMs) to synthesize code for the API under test, leveraging the historical bug data of similar APIs to trigger analogous bugs across APIs. We implement MirrorFuzz and evaluate it on four popular DL frameworks (TensorFlow, PyTorch, OneFlow, and Jittor). Extensive evaluation demonstrates that MirrorFuzz improves code coverage by 39.92\% and 98.20\% compared to state-of-the-art methods on TensorFlow and PyTorch, respectively. Moreover, MirrorFuzz discovers 315 bugs, 262 of which are newly found, and 80 bugs are fixed, with 52 of these bugs assigned CNVD IDs.
Related papers
- LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer [15.118579443741659]
We build on the observation that historical bug reports contain rich, underutilized information about silent bugs.<n>We leverage large language models (LLMs) to perform versatile yet controlled bug transfer for silent bug fuzzing.<n>This enables proactive detection of silent bugs by transferring high-risk contexts and oracle designs from known buggy to functionally similar target.
arXiv Detail & Related papers (2026-02-26T14:53:26Z) - Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [52.483888557864326]
APIKG4SYN is a framework designed to exploit API knowledge graphs for the construction of API-oriented question-code pairs.<n>We build the first benchmark for HarmonyOS code generation using APIKG4SYN.
arXiv Detail & Related papers (2025-11-29T08:13:54Z) - XAMT: Cross-Framework API Matching for Testing Deep Learning Libraries [6.00258665435786]
We propose XAMT, a cross-framework fuzzing method for testing deep learning libraries.<n>XAMT matches APIs using similarity-based rules based on names, descriptions, and parameter structures.<n>It then aligns inputs and applies variance-guided testing to detect bugs.
arXiv Detail & Related papers (2025-08-18T00:48:50Z) - SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.<n>Traditional fuzzing struggles with the complexity and API diversity of DL libraries.<n>We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution.<n> Experimental results demonstrate that ExploraCoder significantly improves performance for models lacking prior API knowledge.
arXiv Detail & Related papers (2024-12-06T19:00:15Z) - Reinforcement Learning-Based REST API Testing with Multi-Coverage [4.127886193201882]
MUCOREST is a novel Reinforcement Learning (RL)-based API testing approach that leverages Q-learning to maximize code coverage and output coverage.
MUCOREST significantly outperforms state-of-the-art API testing approaches by 11.6-261.1% in the number of discovered API bugs.
arXiv Detail & Related papers (2024-10-20T14:20:23Z) - A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models [14.665460257371164]
Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation.
We propose AutoAPIEval, a framework designed to evaluate the capabilities of LLMs in API-oriented code generation.
arXiv Detail & Related papers (2024-09-23T17:22:09Z) - CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.<n>We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.