Related papers: CITADEL: Context Similarity Based Deep Learning Framework Bug Finding

CITADEL: Context Similarity Based Deep Learning Framework Bug Finding

URL: http://arxiv.org/abs/2406.12196v2
Date: Wed, 19 Jun 2024 01:46:25 GMT
Title: CITADEL: Context Similarity Based Deep Learning Framework Bug Finding
Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Shiwei Wang, Chao Shen,
Abstract summary: We propose CITADEL, a method that accelerates the finding of bugs in terms of efficiency and effectiveness. It works by first collecting existing bug reports and identifying problematic APIs. A remarkable 35.40% of the test cases generated by CITADEL can trigger bugs, which significantly transcends the ratios of 0.74%, 1.23%, and 3.90%.
Score: 36.34154201748415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With deep learning (DL) technology becoming an integral part of the new intelligent software, tools of DL framework testing and bug-finding are in high demand. Existing DL framework testing tools have limited coverage on bug types. For example, they lack the capability of finding performance bugs, which are critical for DL model training and inference regarding performance, economics, and the environment. This problem is challenging due to the difficulty of getting test oracles of performance bugs. Moreover, existing tools are inefficient, generating hundreds of test cases with few trigger bugs. In this paper, we propose CITADEL, a method that accelerates the finding of bugs in terms of efficiency and effectiveness. We observe that many DL framework bugs are similar due to the similarity of operators and algorithms belonging to the same family (e.g., Conv2D and Conv3D). Orthogonal to existing bug-finding tools, CITADEL aims to find new bugs that are similar to reported ones that have known test oracles. It works by first collecting existing bug reports and identifying problematic APIs. CITADEL defines context similarity to measure the similarity of DL framework API pairs and automatically generates test cases with oracles for APIs that are similar to the problematic APIs in existing bug reports. CITADEL respectively covers 1,436 PyTorch and 5,380 TensorFlow APIs and effectively detects 79 and 80 API bugs, among which 58 and 68 are new, and 36 and 58 have been confirmed, many of which, e.g., the 11 performance bugs cannot be detected by existing tools. Moreover, a remarkable 35.40% of the test cases generated by CITADEL can trigger bugs, which significantly transcends the ratios of 0.74%, 1.23%, and 3.90% exhibited by the state-of-the-art methods, DocTer, DeepREL, and TitanFuzz.

Related papers

BugScope: Learn to Find Bugs Like Human [9.05553442116139]
BugScope emulates how human auditors learn new bug patterns from representative examples and apply that knowledge during code auditing.<n>Our evaluation on a dataset of 40 real-world bugs drawn from 21 widely-used open-source projects demonstrates that BugScope achieves 87.04% precision.<n>Further testing on large-scale open-source systems, including the Linux kernel, uncovered 141 previously unknown bugs.
arXiv Detail & Related papers (2025-07-21T14:34:01Z)
CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z)
Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors. Traditional fuzzing struggles with the complexity and API diversity of DL libraries. We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z)
Subgraph-Oriented Testing for Deep Learning Libraries [9.78188667672054]
We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects. SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
arXiv Detail & Related papers (2024-12-09T12:10:48Z)
Leveraging Data Characteristics for Bug Localization in Deep Learning Programs [21.563130049562357]
We propose Theia, which detects and localizes structural bugs in Deep Learning (DL) programs. Our results show that Theia successfully localizes 57/75 structural bugs in 40 buggy programs, whereas NeuraLint, a state-of-the-art approach capable of localizing structural bugs before training localizes 17/75 bugs.
arXiv Detail & Related papers (2024-12-08T01:52:06Z)
Reinforcement Learning-Based REST API Testing with Multi-Coverage [4.127886193201882]
MUCOREST is a novel Reinforcement Learning (RL)-based API testing approach that leverages Q-learning to maximize code coverage and output coverage. MUCOREST significantly outperforms state-of-the-art API testing approaches by 11.6-261.1% in the number of discovered API bugs.
arXiv Detail & Related papers (2024-10-20T14:20:23Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Automated Bug Generation in the era of Large Language Models [6.0770779409377775]
BugFarm transforms arbitrary code into multiple complex bugs. A comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM.
arXiv Detail & Related papers (2023-10-03T20:01:51Z)
PreciseBugCollector: Extensible, Executable and Precise Bug-fix Collection [8.79879909193717]
We introduce PreciseBugCollector, a precise, multi-language bug collection approach. It is based on two novel components: a bug tracker to map the repositories with external bug repositories to trace bug type information, and a bug injector to generate project-specific bugs. To date, PreciseBugCollector comprises 1057818 bugs extracted from 2968 open-source projects.
arXiv Detail & Related papers (2023-09-12T13:47:44Z)
An Analysis of Bugs In Persistent Memory Application [0.0]
We evaluate an open-sourced automatic bug detector tool (i.e. AGAMOTTO) to test NVM level hashing PM application. Our faithful validation tool able to discovered 65 new NVM level hashing bugs on PMDK library. We will propose a Deep-Q Learning search algorithm over the PM-Aware search algorithm to improve the searching strategy efficiently.
arXiv Detail & Related papers (2023-07-19T23:12:01Z)
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering. AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs. Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z)
Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers. We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.