Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and
TensorFlow
- URL: http://arxiv.org/abs/2112.13314v2
- Date: Fri, 1 Sep 2023 18:01:40 GMT
- Title: Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and
TensorFlow
- Authors: Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol
- Abstract summary: Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts.
This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user.
This paper presents the first empirical study of Keras and silent bugs, and their impact on users' programs.
- Score: 13.260758930014154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) frameworks are now widely used, simplifying the creation
of complex models as well as their integration to various applications even to
non DL experts. However, like any other programs, they are prone to bugs. This
paper deals with the subcategory of bugs named silent bugs: they lead to wrong
behavior but they do not cause system crashes or hangs, nor show an error
message to the user. Such bugs are even more dangerous in DL applications and
frameworks due to the "black-box" and stochastic nature of the systems (the end
user can not understand how the model makes decisions). This paper presents the
first empirical study of Keras and TensorFlow silent bugs, and their impact on
users' programs. We extracted closed issues related to Keras from the
TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were
reproducible silent bugs affecting users' programs. We categorized the bugs
based on the effects on the users' programs and the components where the issues
occurred, using information from the issue reports. We then derived a threat
level for each of the issues, based on the impact they had on the users'
programs. To assess the relevance of identified categories and the impact
scale, we conducted an online survey with 103 DL developers. The participants
generally agreed with the significant impact of silent bugs in DL libraries and
acknowledged our findings (i.e., categories of silent bugs and the proposed
impact scale). Finally, leveraging our analysis, we provide a set of guidelines
to facilitate safeguarding against such bugs in DL frameworks.
Related papers
- The Impact Of Bug Localization Based on Crash Report Mining: A Developers' Perspective [7.952391285456257]
We report our experience of using an approach for grouping crash reports and finding buggy code on a weekly basis for 18 months.
The approach investigated in this study correctly suggested the buggy file most of the time -- the approach's precision was around 80%.
arXiv Detail & Related papers (2024-03-16T01:23:01Z) - Towards Understanding the Challenges of Bug Localization in Deep
Learning Systems [2.9312156642007294]
We conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems.
First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software.
Second, we evaluate how different bug types in deep learning systems impact bug localization.
arXiv Detail & Related papers (2024-02-01T21:17:42Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - A Comprehensive Empirical Study of Bugs in Open-Source Federated
Learning Frameworks [11.835104059182832]
Federated learning (FL) is a distributed machine learning (ML) paradigm, allowing multiple clients to collaboratively train (ML) models without exposing clients' data privacy.
To foster the application of FL, a variety of FL frameworks have been proposed, allowing non-experts to easily train ML models.
We conduct the first empirical study to comprehensively collect, taxonomize, and characterize bugs in FL frameworks.
arXiv Detail & Related papers (2023-08-09T15:14:16Z) - An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework.
Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z) - What Happens When We Fuzz? Investigating OSS-Fuzz Bug History [0.9772968596463595]
We analyzed 44,102 reported issues made public by OSS-Fuzz prior to March 12, 2022.
We identified the bug-contributing commits to estimate when the bug containing code was introduced, and measure the timeline from introduction to detection to fix.
arXiv Detail & Related papers (2023-05-19T05:15:36Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Indiscriminate Poisoning Attacks Are Shortcuts [77.38947817228656]
We find that the perturbations of advanced poisoning attacks are almost textbflinear separable when assigned with the target labels of the corresponding samples.
We show that such synthetic perturbations are as powerful as the deliberately crafted attacks.
Our finding suggests that the emphshortcut learning problem is more serious than previously believed.
arXiv Detail & Related papers (2021-11-01T12:44:26Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.