DaiFu: In-Situ Crash Recovery for Deep Learning Systems
- URL: http://arxiv.org/abs/2507.01628v1
- Date: Wed, 02 Jul 2025 11:58:38 GMT
- Title: DaiFu: In-Situ Crash Recovery for Deep Learning Systems
- Authors: Zilong He, Pengfei Chen, Hongyu Zhang, Xiaoyun Li, Guangba Yu, Hongyang Chen, Zibin Zheng,
- Abstract summary: We present DaiFu, an in-situ recovery framework for deep learning (DL) systems.<n>DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context.<n>Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions.
- Score: 54.52831889359226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible (under 0.40%). We also construct a benchmark spanning 7 distinct crash scenarios in DL systems, and show the effectiveness of DaiFu in diverse situations.
Related papers
- Scalable and Accurate Application-Level Crash-Consistency Testing via Representative Testing [4.659174681934402]
We build Pathfinder, a crash-consistency testing tool that implements an update behaviors-based to approximate a small set of representative crash states.<n> Pathfinder scales more effectively to large applications than prior works and finds 4x more bugs in POSIX-based applications and 8x more bugs in MMIO-based applications.
arXiv Detail & Related papers (2025-03-03T10:41:57Z) - Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces [3.3158239079459655]
We present a novel approach to localize faults based only on the stack trace information and no additional runtime information.<n>By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9%.
arXiv Detail & Related papers (2025-01-29T21:40:32Z) - DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [59.96455188197593]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.<n>We propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data.<n> Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning.
arXiv Detail & Related papers (2024-11-21T12:02:39Z) - KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [59.20933707301566]
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks.
In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel.
To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym and kBench.
arXiv Detail & Related papers (2024-07-02T21:44:22Z) - Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports.
We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes.
Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z) - Crash Report Accumulation During Continuous Fuzzing [0.0]
We propose a crash accumulation method and implement it as part of the CASR toolset.
We evaluate our approach on crash reports collected from fuzzing results.
arXiv Detail & Related papers (2024-05-28T13:36:31Z) - CrashTranslator: Automatically Reproducing Mobile Application Crashes
Directly from Stack Trace [30.48737611250448]
This paper proposes an approach named CrashTranslator to automatically reproduce mobile application crashes directly from the stack trace.
We evaluate CrashTranslator on 75 crash reports involving 58 popular Android apps, and it successfully reproduces 61.3% of the crashes.
arXiv Detail & Related papers (2023-10-11T02:00:18Z) - Large-scale Crash Localization using Multi-Task Learning [3.4383679424643456]
We develop a novel multi-task sequence labeling approach for identifying blamed frames in stack traces.
We evaluate our model with over a million real-world crashes from four popular Microsoft applications.
arXiv Detail & Related papers (2021-09-29T10:26:57Z) - Always Be Dreaming: A New Approach for Data-Free Class-Incremental
Learning [73.24988226158497]
We consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL)
We propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation.
Our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks.
arXiv Detail & Related papers (2021-06-17T17:56:08Z) - Transferable, Controllable, and Inconspicuous Adversarial Attacks on
Person Re-identification With Deep Mis-Ranking [83.48804199140758]
We propose a learning-to-mis-rank formulation to perturb the ranking of the system output.
We also perform a back-box attack by developing a novel multi-stage network architecture.
Our method can control the number of malicious pixels by using differentiable multi-shot sampling.
arXiv Detail & Related papers (2020-04-08T18:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.