JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
- URL: http://arxiv.org/abs/2510.18013v3
- Date: Mon, 10 Nov 2025 13:52:00 GMT
- Title: JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
- Authors: Yiran Wang, José Antonio Hernández López, Ulf Nilsson, Dániel Varró,
- Abstract summary: We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks.<n>JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks.
- Score: 4.768285672660128
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.
Related papers
- Runtime-Augmented LLMs for Crash Detection and Diagnosis in ML Notebooks [4.768285672660128]
We present CRANE-LLM, a novel approach that augments large language models with structured runtime information extracted from the notebook kernel state to detect and diagnose crashes.<n>Given previously executed cells and a target cell, CRANE-LLM combines static code context with runtime information, including object types, tensor shapes, and data attributes, to predict whether the target cell will crash.<n>We evaluate CRANE-LLM on JunoBench, a benchmark of 222 ML notebooks comprising 111 pairs of crashing and corresponding non-crashing notebooks.
arXiv Detail & Related papers (2026-02-20T13:19:06Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - A Regression Testing Framework with Automated Assertion Generation for Machine Learning Notebooks [2.5834567990387565]
We introduce NBTest - the first regression testing framework that allows developers to write cell-level assertions in notebooks.<n> NBTest offers a library of assertion APIs, and a JupyterLab plugin that enables executing assertions.<n>We evaluate NBTest on 592 Kaggle notebooks.
arXiv Detail & Related papers (2025-09-17T03:05:16Z) - CrashFixer: A crash resolution agent for the Linux kernel [58.152358195983155]
This work builds upon kGym, which shares a benchmark for system-level Linux kernel bugs and a platform to run experiments on the Linux kernel.<n>This paper introduces CrashFixer, the first LLM-based software repair agent that is applicable to Linux kernel bugs.
arXiv Detail & Related papers (2025-04-29T04:18:51Z) - Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks [1.8292110434077904]
We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle.<n>We analyze a sample of 746 crashes across various aspects, including crash types and root causes.<n>We find that over 40% of crashes stem from API misuse and notebook-specific issues.
arXiv Detail & Related papers (2024-11-25T09:33:08Z) - STAMP: Outlier-Aware Test-Time Adaptation with Stable Memory Replay [76.06127233986663]
Test-time adaptation (TTA) aims to address the distribution shift between the training and test data with only unlabeled data at test time.
This paper pays attention to the problem that conducts both sample recognition and outlier rejection during inference while outliers exist.
We propose a new approach called STAble Memory rePlay (STAMP), which performs optimization over a stable memory bank instead of the risky mini-batch.
arXiv Detail & Related papers (2024-07-22T16:25:41Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Pynblint: a Static Analyzer for Python Jupyter Notebooks [10.190501703364234]
Pynblint is a static analyzer for Jupyter notebooks written in Python.
It checks compliance of notebooks (and surrounding repositories) with a set of empirically validated best practices.
arXiv Detail & Related papers (2022-05-24T09:56:03Z) - ReproduceMeGit: A Visualization Tool for Analyzing Reproducibility of
Jupyter Notebooks [0.0]
We present ReproduceMeGit, a visualization tool for analyzing the GitHub of Jupyter Notebooks.
The tool provides information on the number of notebooks that were successfully reproducible, those that resulted in exceptions, those with different results from the original notebooks, etc.
arXiv Detail & Related papers (2020-06-22T10:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.