Why do Machine Learning Notebooks Crash?
- URL: http://arxiv.org/abs/2411.16795v1
- Date: Mon, 25 Nov 2024 09:33:08 GMT
- Title: Why do Machine Learning Notebooks Crash?
- Authors: Yiran Wang, Willem Meijer, José Antonio Hernández López, Ulf Nilsson, Dániel Varró,
- Abstract summary: We collect 64,031 ML notebooks containing 92,542 crashes from GitHub and Kaggle.
We analyze a sample of 746 crashes across various aspects, including exception types and root causes.
Our analysis reveals that 87% of crashes are caused by API misuse, data confusion, notebook-specific issues, environment problems, and implementation errors.
- Score: 1.8292110434077904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jupyter notebooks have become central in data science, integrating code, text and output in a flexible environment. With the rise of machine learning (ML), notebooks are increasingly used for ML prototyping and data analysis. However, due to their dependence on complex ML libraries and the flexible notebook semantics that allow cells to be run in any order, notebooks are susceptible to software bugs that may lead to program crashes. This paper presents a comprehensive empirical study focused on crashes in ML notebooks. We collect 64,031 ML notebooks containing 92,542 crashes from GitHub and Kaggle, and manually analyze a sample of 746 crashes across various aspects, including exception types and root causes. Our analysis highlights unique root causes related to notebook semantics, including out-of-order execution and previous cell error, that have not been thoroughly covered in earlier research. Furthermore, we categorize crashes as ML bugs or general Python bugs and examine how the crashes are distributed across different stages of the ML pipeline. Our analysis reveals that 87% of crashes are caused by API misuse, data confusion, notebook-specific issues, environment problems, and implementation errors. Crashes are more commonly related to ML bugs than general Python bugs, particularly in Kaggle notebooks, where over 70% of crashes are ML-related. The distribution of crashes across ML pipeline stages differs between the two platforms. Additionally, most crashes (58%) occur during data preparation, model training, and evaluation/prediction stages of the ML pipeline. GitHub and Kaggle exhibit different crash distributions across these stages.
Related papers
- CrashFixer: A crash resolution agent for the Linux kernel [58.152358195983155]
This work builds upon kGym, which shares a benchmark for system-level Linux kernel bugs and a platform to run experiments on the Linux kernel.
This paper introduces CrashFixer, the first LLM-based software repair agent that is applicable to Linux kernel bugs.
arXiv Detail & Related papers (2025-04-29T04:18:51Z) - Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)
We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.
This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces [3.3158239079459655]
We present a novel approach to localize faults based only on the stack trace information and no additional runtime information.
By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9%.
arXiv Detail & Related papers (2025-01-29T21:40:32Z) - Your Fix Is My Exploit: Enabling Comprehensive DL Library API Fuzzing with Large Language Models [49.214291813478695]
Deep learning (DL) libraries, widely used in AI applications, often contain vulnerabilities like overflows and use buffer-free errors.
Traditional fuzzing struggles with the complexity and API diversity of DL libraries.
We propose DFUZZ, an LLM-driven fuzzing approach for DL libraries.
arXiv Detail & Related papers (2025-01-08T07:07:22Z) - Subgraph-Oriented Testing for Deep Learning Libraries [9.78188667672054]
We propose SORT (Subgraph-Oriented Realistic Testing) to test Deep Learning (DL) libraries on different hardware platforms.
SORT takes popular API interaction patterns, represented as frequent subgraphs of model graphs, as test subjects.
SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing.
arXiv Detail & Related papers (2024-12-09T12:10:48Z) - InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems [47.93470713879515]
InternLM2.5-Steper achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks.
It achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus.
arXiv Detail & Related papers (2024-10-21T07:18:23Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports.
We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes.
Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z) - CrashTranslator: Automatically Reproducing Mobile Application Crashes
Directly from Stack Trace [30.48737611250448]
This paper proposes an approach named CrashTranslator to automatically reproduce mobile application crashes directly from the stack trace.
We evaluate CrashTranslator on 75 crash reports involving 58 popular Android apps, and it successfully reproduces 61.3% of the crashes.
arXiv Detail & Related papers (2023-10-11T02:00:18Z) - Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs)
We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables.
We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z) - What Causes Exceptions in Machine Learning Applications? Mining Machine
Learning-Related Stack Traces on Stack Overflow [6.09414932258309]
We study 11,449 stack traces related to seven popular Python ML libraries on Stack Overflow.
ML questions that contain stack traces gain more popularity than questions without stack traces.
Patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers.
arXiv Detail & Related papers (2023-04-25T14:29:07Z) - OmniXAI: A Library for Explainable AI [98.07381528393245]
We introduce OmniXAI, an open-source Python library of eXplainable AI (XAI)
It offers omni-way explainable AI capabilities and various interpretable machine learning techniques.
For practitioners, the library provides an easy-to-use unified interface to generate the explanations for their applications.
arXiv Detail & Related papers (2022-06-01T11:35:37Z) - Pynblint: a Static Analyzer for Python Jupyter Notebooks [10.190501703364234]
Pynblint is a static analyzer for Jupyter notebooks written in Python.
It checks compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices.
arXiv Detail & Related papers (2022-05-24T09:56:03Z) - Large-scale Crash Localization using Multi-Task Learning [3.4383679424643456]
We develop a novel multi-task sequence labeling approach for identifying blamed frames in stack traces.
We evaluate our model with over a million real-world crashes from four popular Microsoft applications.
arXiv Detail & Related papers (2021-09-29T10:26:57Z) - S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning.
It is based on a biLSTM encoder and a fully-connected classifier to compute similarity.
Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.