Toward Understanding Deep Learning Framework Bugs
- URL: http://arxiv.org/abs/2203.04026v4
- Date: Wed, 21 Aug 2024 06:52:04 GMT
- Title: Toward Understanding Deep Learning Framework Bugs
- Authors: Junjie Chen, Yihua Liang, Qingchao Shen, Jiajun Jiang, Shuochuan Li,
- Abstract summary: We conduct a large-scale study on 1,000 bugs from four popular and diverse DL frameworks.
We obtain 12 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing practice.
Based on the guidelines, we design and implement a prototype DL-framework testing tool, called TenFuzz.
- Score: 6.38591361654859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: DL frameworks are the basis of constructing all DL programs and models, and thus their bugs could lead to the unexpected behaviors of any DL program or model relying on them. Such a wide effect demonstrates the necessity and importance of guaranteeing DL frameworks' quality. Understanding the characteristics of DL framework bugs is a fundamental step for this quality assurance task, facilitating designing effective bug detection and debugging approaches. Hence, in this work we conduct the most large-scale study on 1,000 bugs from four popular and diverse DL frameworks (i.e., TensorFlow, PyTorch, MXNet, and DL4J). By analyzing the root causes and symptoms of DL framework bugs associated with 5 components decomposed from DL frameworks, as well as measuring test coverage achieved by three state-of-the-art testing techniques, we obtain 12 major findings for the comprehensive understanding of DL framework bugs and the current status of existing DL framework testing practice, and then provide a series of actionable guidelines for better DL framework bug detection and debugging. Finally, based on the guidelines, we design and implement a prototype DL-framework testing tool, called TenFuzz, which is evaluated to be effective and finds 3 unknown bugs on the latest TensorFlow framework in a preliminary study, indicating the significance of our guidelines.
Related papers
- Mutation-Based Deep Learning Framework Testing Method in JavaScript Environment [16.67312523556796]
We propose a mutation-based JavaScript DL framework testing method named DLJSFuzzer.
DLJSFuzzer successfully detects 21 unique crashes and unique 126 NaN & Inconsistency bugs.
DLJSFuzzer has improved by over 47% in model generation efficiency and over 91% in bug detection efficiency compared to all baselines.
arXiv Detail & Related papers (2024-09-23T12:37:56Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.
We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z) - DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection [12.686480870065827]
This paper contributes textbfDLAP, a framework that combines the best of both deep learning (DL) models and Large Language Models (LLMs) to achieve exceptional vulnerability detection performance.
Experiment results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts.
arXiv Detail & Related papers (2024-05-02T11:44:52Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Benchmarking the Robustness of LiDAR Semantic Segmentation Models [78.6597530416523]
In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions.
We propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy.
We design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications.
arXiv Detail & Related papers (2023-01-03T06:47:31Z) - MEMO: Coverage-guided Model Generation For Deep Learning Library Testing [11.263121366956726]
A few techniques have been proposed to test deep learning (DL) libraries by generating DL models as test inputs.
But the test effectiveness of these techniques is constrained by the diversity of generated DL models.
We propose MEMO to efficiently generate diverse DL models by exploring layer types, layer pairs, and layer parameters.
arXiv Detail & Related papers (2022-08-02T14:53:02Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Challenges in Migrating Imperative Deep Learning Programs to Graph
Execution: An Empirical Study [4.415977307120617]
We conduct a data-driven analysis of challenges -- and resultant bugs -- involved in writing reliable yet performant imperative DL code.
We put forth several recommendations, best practices, and anti-patterns for effectively hybridizing imperative DL code.
arXiv Detail & Related papers (2022-01-24T21:12:38Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.