Related papers: Runtime-Augmented LLMs for Crash Detection and Diagnosis in ML Notebooks

Runtime-Augmented LLMs for Crash Detection and Diagnosis in ML Notebooks

URL: http://arxiv.org/abs/2602.18537v1
Date: Fri, 20 Feb 2026 13:19:06 GMT
Title: Runtime-Augmented LLMs for Crash Detection and Diagnosis in ML Notebooks
Authors: Yiran Wang, José Antonio Hernández López, Ulf Nilsson, Dániel Varró,
Abstract summary: We present CRANE-LLM, a novel approach that augments large language models with structured runtime information extracted from the notebook kernel state to detect and diagnose crashes.<n>Given previously executed cells and a target cell, CRANE-LLM combines static code context with runtime information, including object types, tensor shapes, and data attributes, to predict whether the target cell will crash.<n>We evaluate CRANE-LLM on JunoBench, a benchmark of 222 ML notebooks comprising 111 pairs of crashing and corresponding non-crashing notebooks.
Score: 4.768285672660128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jupyter notebooks are widely used for machine learning (ML) development due to their support for interactive and iterative experimentation. However, ML notebooks are highly prone to bugs, with crashes being among the most disruptive. Despite their practical importance, systematic methods for crash detection and diagnosis in ML notebooks remain largely unexplored. We present CRANE-LLM, a novel approach that augments large language models (LLMs) with structured runtime information extracted from the notebook kernel state to detect and diagnose crashes before executing a target cell. Given previously executed cells and a target cell, CRANE-LLM combines static code context with runtime information, including object types, tensor shapes, and data attributes, to predict whether the target cell will crash (detection) and explain the underlying cause (diagnosis). We evaluate CRANE-LLM on JunoBench, a benchmark of 222 ML notebooks comprising 111 pairs of crashing and corresponding non-crashing notebooks across multiple ML libraries and crash root causes. Across three state-of-the-art LLMs (Gemini, Qwen, and GPT-5), runtime information improves crash detection and diagnosis by 7-10 percentage points in accuracy and 8-11 in F1-score, with larger gains for diagnosis. Improvements vary across ML libraries, crash causes, and LLMs, and depends on the integration of complementary categories of runtime information.

Related papers

Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z)
InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration [71.18377595277018]
Large Language Models (LLMs) frequently generate buggy code with complex logic errors that are challenging to diagnose.<n>We present InspectCoder, the first agentic program repair system that empowers LLMs to actively conduct dynamic analysis via interactive debugger control.
arXiv Detail & Related papers (2025-10-21T06:26:29Z)
JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks [4.768285672660128]
We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks.<n>JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks.
arXiv Detail & Related papers (2025-10-20T18:46:43Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models [60.713908578319256]
We propose Direct Discrepancy Learning (DDL) to optimize the detector with task-oriented knowledge.<n>Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance.<n>MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs.
arXiv Detail & Related papers (2025-09-15T10:59:57Z)
When the Code Autopilot Breaks: Why LLMs Falter in Embedded Machine Learning [0.8880611506199766]
We show how prompt format, model behavior, and structural assumptions influence both success rates and failure characteristics.<n>Our analysis reveals a diverse set of error-prone behaviors, including format-induced misinterpretations and runtime-disruptive code that compiles but breaks downstream.
arXiv Detail & Related papers (2025-09-13T19:00:04Z)
CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks [8.967739950302407]
Investigating a notebook via re-execution often is impractical due to challenges to resolving data and software ambiguities.<n>We develop a strategy that uses limited syntactic analysis to assist full comprehension of a Python notebook.<n>We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks.
arXiv Detail & Related papers (2025-07-15T21:14:08Z)
LAMeD: LLM-generated Annotations for Memory Leak Detection [5.529919602615033]
We present LAMeD, a novel approach to automatically generate function-specific annotations.<n>When integrated with analyzers such as Cooddy, LAMeD significantly improves memory leak detection and reduces path explosion.
arXiv Detail & Related papers (2025-05-05T05:34:33Z)
Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks [1.8292110434077904]
We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle.<n>We analyze a sample of 746 crashes across various aspects, including crash types and root causes.<n>We find that over 40% of crashes stem from API misuse and notebook-specific issues.
arXiv Detail & Related papers (2024-11-25T09:33:08Z)
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z)
PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning [58.85063149619348]
We propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets.
arXiv Detail & Related papers (2023-01-25T16:34:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.