Automatic Static Bug Detection for Machine Learning Libraries: Are We
There Yet?
- URL: http://arxiv.org/abs/2307.04080v1
- Date: Sun, 9 Jul 2023 01:38:52 GMT
- Title: Automatic Static Bug Detection for Machine Learning Libraries: Are We
There Yet?
- Authors: Nima Shiri harzevili, Jiho Shin, Junjie Wang, Song Wang, Nachiappan
Nagappan
- Abstract summary: We analyze five popular and widely used static bug detectors, i.e., Flawfinder, RATS, Cppcheck, Facebook Infer, and Clang, on a curated dataset of software bugs.
Overall, our study shows that static bug detectors find a negligible amount of all bugs accounting for 6/410 bugs (0.01%), Flawfinder and RATS are the most effective static checker for finding software bugs in machine learning libraries.
- Score: 14.917820383894124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic detection of software bugs is a critical task in software security.
Many static tools that can help detect bugs have been proposed. While these
static bug detectors are mainly evaluated on general software projects call
into question their practical effectiveness and usefulness for machine learning
libraries. In this paper, we address this question by analyzing five popular
and widely used static bug detectors, i.e., Flawfinder, RATS, Cppcheck,
Facebook Infer, and Clang static analyzer on a curated dataset of software bugs
gathered from four popular machine learning libraries including Mlpack, MXNet,
PyTorch, and TensorFlow with a total of 410 known bugs. Our research provides a
categorization of these tools' capabilities to better understand the strengths
and weaknesses of the tools for detecting software bugs in machine learning
libraries. Overall, our study shows that static bug detectors find a negligible
amount of all bugs accounting for 6/410 bugs (0.01%), Flawfinder and RATS are
the most effective static checker for finding software bugs in machine learning
libraries. Based on our observations, we further identify and discuss
opportunities to make the tools more effective and practical.
Related papers
- KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Software issues report for bug fixing process: An empirical study of
machine-learning libraries [0.0]
We investigated the effectiveness of issue resolution for bug-fixing processes in six machine-learning libraries.
The most common categories of issues that arise in machine-learning libraries are bugs, documentation, optimization, crashes, enhancement, new feature requests, build/CI, support, and performance.
This study concludes that efficient issue-tracking processes, effective communication, and collaboration are vital for effective resolution of issues and bug fixing processes in machine-learning libraries.
arXiv Detail & Related papers (2023-12-10T21:33:19Z) - An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework.
Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors.
Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z) - ALBench: A Framework for Evaluating Active Learning in Object Detection [102.81795062493536]
This paper contributes an active learning benchmark framework named as ALBench for evaluating active learning in object detection.
Developed on an automatic deep model training system, this ALBench framework is easy-to-use, compatible with different active learning algorithms, and ensures the same training and testing protocols.
arXiv Detail & Related papers (2022-07-27T07:46:23Z) - BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization.
We provide a general benchmark with a diversity of real and synthetic Java bugs.
We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z) - Self-Supervised Bug Detection and Repair [27.46717890823656]
We present BugLab, an approach for self-supervised learning of bug detection and repair.
A Python implementation of BugLab improves by up to 30% upon baseline methods on a test dataset of 2374 real-life bugs.
arXiv Detail & Related papers (2021-05-26T18:41:05Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z) - Smoke Testing for Machine Learning: Simple Tests to Discover Severe
Defects [7.081604594416339]
We try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing.
We were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries.
arXiv Detail & Related papers (2020-09-03T08:54:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.