Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and
TensorFlow
- URL: http://arxiv.org/abs/2112.13314v2
- Date: Fri, 1 Sep 2023 18:01:40 GMT
- Title: Silent Bugs in Deep Learning Frameworks: An Empirical Study of Keras and
TensorFlow
- Authors: Florian Tambon, Amin Nikanjam, Le An, Foutse Khomh, Giuliano Antoniol
- Abstract summary: Deep Learning (DL) frameworks are now widely used, simplifying the creation of complex models as well as their integration to various applications even to non DL experts.
This paper deals with the subcategory of bugs named silent bugs: they lead to wrong behavior but they do not cause system crashes or hangs, nor show an error message to the user.
This paper presents the first empirical study of Keras and silent bugs, and their impact on users' programs.
- Score: 13.260758930014154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Learning (DL) frameworks are now widely used, simplifying the creation
of complex models as well as their integration to various applications even to
non DL experts. However, like any other programs, they are prone to bugs. This
paper deals with the subcategory of bugs named silent bugs: they lead to wrong
behavior but they do not cause system crashes or hangs, nor show an error
message to the user. Such bugs are even more dangerous in DL applications and
frameworks due to the "black-box" and stochastic nature of the systems (the end
user can not understand how the model makes decisions). This paper presents the
first empirical study of Keras and TensorFlow silent bugs, and their impact on
users' programs. We extracted closed issues related to Keras from the
TensorFlow GitHub repository. Out of the 1,168 issues that we gathered, 77 were
reproducible silent bugs affecting users' programs. We categorized the bugs
based on the effects on the users' programs and the components where the issues
occurred, using information from the issue reports. We then derived a threat
level for each of the issues, based on the impact they had on the users'
programs. To assess the relevance of identified categories and the impact
scale, we conducted an online survey with 103 DL developers. The participants
generally agreed with the significant impact of silent bugs in DL libraries and
acknowledged our findings (i.e., categories of silent bugs and the proposed
impact scale). Finally, leveraging our analysis, we provide a set of guidelines
to facilitate safeguarding against such bugs in DL frameworks.
Related papers
- "Silent Is Not Actually Silent": An Investigation of Toxicity on Bug Report Discussion [0.0]
This study explores toxicity in GitHub bug reports through a qualitative analysis of 203 bug threads, including 81 toxic ones.
Our findings reveal that toxicity frequently arises from misaligned perceptions of bug severity and priority, unresolved frustrations with tools, and lapses in professional communication.
Our preliminary findings offer actionable recommendations to improve bug resolution by mitigating toxicity.
arXiv Detail & Related papers (2025-03-13T05:39:29Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - Leveraging Data Characteristics for Bug Localization in Deep Learning Programs [21.563130049562357]
We propose Theia, which detects and localizes structural bugs in Deep Learning (DL) programs.
Our results show that Theia successfully localizes 57/75 structural bugs in 40 buggy programs, whereas NeuraLint, a state-of-the-art approach capable of localizing structural bugs before training localizes 17/75 bugs.
arXiv Detail & Related papers (2024-12-08T01:52:06Z) - CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [36.34154201748415]
Existing deep learning (DL) framework testing tools have limited coverage on bug types.
We propose Citadel, a method that accelerates the finding of bugs in terms of efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-18T01:51:16Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - The Impact Of Bug Localization Based on Crash Report Mining: A Developers' Perspective [7.952391285456257]
We report our experience of using an approach for grouping crash reports and finding buggy code on a weekly basis for 18 months.
The approach investigated in this study correctly suggested the buggy file most of the time -- the approach's precision was around 80%.
arXiv Detail & Related papers (2024-03-16T01:23:01Z) - Towards Understanding the Challenges of Bug Localization in Deep
Learning Systems [2.9312156642007294]
We conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems.
First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software.
Second, we evaluate how different bug types in deep learning systems impact bug localization.
arXiv Detail & Related papers (2024-02-01T21:17:42Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - A Comprehensive Empirical Study of Bugs in Open-Source Federated
Learning Frameworks [11.835104059182832]
Federated learning (FL) is a distributed machine learning (ML) paradigm, allowing multiple clients to collaboratively train (ML) models without exposing clients' data privacy.
To foster the application of FL, a variety of FL frameworks have been proposed, allowing non-experts to easily train ML models.
We conduct the first empirical study to comprehensively collect, taxonomize, and characterize bugs in FL frameworks.
arXiv Detail & Related papers (2023-08-09T15:14:16Z) - An Empirical Study on Bugs Inside PyTorch: A Replication Study [10.848682558737494]
We characterize bugs in the PyTorch library, a very popular deep learning framework.
Our results highlight that PyTorch bugs are more like traditional software projects bugs, than related to deep learning characteristics.
arXiv Detail & Related papers (2023-07-25T19:23:55Z) - Using Developer Discussions to Guide Fixing Bugs in Software [51.00904399653609]
We propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for additional information from developers.
We demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
arXiv Detail & Related papers (2022-11-11T16:37:33Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Indiscriminate Poisoning Attacks Are Shortcuts [77.38947817228656]
We find that the perturbations of advanced poisoning attacks are almost textbflinear separable when assigned with the target labels of the corresponding samples.
We show that such synthetic perturbations are as powerful as the deliberately crafted attacks.
Our finding suggests that the emphshortcut learning problem is more serious than previously believed.
arXiv Detail & Related papers (2021-11-01T12:44:26Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.