Related papers: A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

URL: http://arxiv.org/abs/2512.20345v1
Date: Tue, 23 Dec 2025 13:27:05 GMT
Title: A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems
Authors: Xiaoxue Ma, Wanwei Zhan, Jiale Chen, Yishu Li, Jacky Keung, Federica Sarro,
Abstract summary: This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks.<n>We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns.<n>Our results show that 45.1% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent.
Score: 7.767904938990508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple GPUs or machines in parallel. While general-purpose frameworks (e.g., TensorFlow and PyTorch) provide distributed capabilities, these are often add-on features that demand significant manual effort for advanced parallelism, underscoring the need for specialized frameworks. This study conducts the first large-scale empirical analysis of practitioner challenges in dedicated distributed frameworks. We examine 849 real-world issues from DeepSpeed, Megatron-LM, and Colossal-AI and construct a taxonomy of 34 bug symptoms, 28 root causes, and 6 fix patterns. Crucially, we establish explicit mappings between symptoms, causes, and fixes across distributed training stages, enabling a systematic understanding of how issues emerge and are resolved. Our results show that 45.1\% of bug symptoms are unique to distributed frameworks, with setup failures, memory issues, and performance anomalies being the most prevalent. Moreover, 95\% of issues in the communication setup stage occur exclusively in distributed contexts. We also find over 60\% of cases can be resolved through version and dependency management, and distributed feature, API, and communication tuning. Based on these findings, we provide actionable implications.

Related papers

Context-Specific Causal Graph Discovery with Unobserved Contexts: Non-Stationarity, Regimes and Spatio-Temporal Patterns [8.121462458089143]
We study the information encoded in changes of the causal graph, with stability in mind.<n>A built-in modularity allows to systematically understand and improve upon an entire array of subproblems.
arXiv Detail & Related papers (2025-11-26T16:06:36Z)
Cross-Learning from Scarce Data via Multi-Task Constrained Optimization [70.90607489166648]
This paper introduces a multi-task emphcross-learning framework to overcome data scarcity.<n>We formulate this joint estimation as a constrained optimization problem.<n>We show the efficiency of our cross-learning method in applications with real data including image classification and propagation of infectious diseases.
arXiv Detail & Related papers (2025-11-17T18:35:59Z)
Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models [7.486731499255164]
This paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI.<n>We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies.
arXiv Detail & Related papers (2025-06-12T07:24:59Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [89.92916473403108]
This paper proposes a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods.<n>The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics.<n>We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
A Survey of Deep Long-Tail Classification Advancements [1.6233132273470656]
Many data distributions in the real world are hardly uniform. Instead, skewed and long-tailed distributions of various kinds are commonly observed. This poses an interesting problem for machine learning, where most algorithms assume or work well with uniformly distributed data. The problem is further exacerbated by current state-of-the-art deep learning models requiring large volumes of training data.
arXiv Detail & Related papers (2024-04-24T01:59:02Z)
Multi-modal Causal Structure Learning and Root Cause Analysis [67.67578590390907]
We propose Mulan, a unified multi-modal causal structure learning method for root cause localization. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. We also introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph.
arXiv Detail & Related papers (2024-02-04T05:50:38Z)
Leveraging Frequency Domain Learning in 3D Vessel Segmentation [50.54833091336862]
In this study, we leverage Fourier domain learning as a substitute for multi-scale convolutional kernels in 3D hierarchical segmentation models. We show that our novel network achieves remarkable dice performance (84.37% on ASACA500 and 80.32% on ImageCAS) in tubular vessel segmentation tasks.
arXiv Detail & Related papers (2024-01-11T19:07:58Z)
The PetShop Dataset -- Finding Causes of Performance Issues across Microservices [3.87228935312714]
This paper introduces a dataset specifically designed for evaluating root cause analyses in microservice-based applications. The dataset encompasses latency, requests, and availability metrics emitted in 5-minute intervals from a distributed application. In addition to normal operation metrics, the dataset includes 68 injected performance issues, which increase latency and reduce availability throughout the system.
arXiv Detail & Related papers (2023-11-08T16:30:12Z)
XFlow: Benchmarking Flow Behaviors over Graphs [8.324217198711208]
The study of networks encompasses a diverse range of academic disciplines, including mathematics, physics, social science, and computer science. One of the primary obstacles to current research in this field is the absence of a curated benchmark suite to study the flow behaviors under network scenarios. We propose the implementation of a novel benchmark suite that encompasses a variety of tasks, baseline models, graph datasets, and evaluation tools.
arXiv Detail & Related papers (2023-08-07T14:49:22Z)
On-Device Domain Generalization [93.79736882489982]
Domain generalization is critical to on-device machine learning applications. We find that knowledge distillation is a strong candidate for solving the problem. We propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data.
arXiv Detail & Related papers (2022-09-15T17:59:31Z)
Causal Scene BERT: Improving object detection by searching for challenging groups of data [125.40669814080047]
Computer vision applications rely on learning-based perception modules parameterized with neural networks for tasks like object detection. These modules frequently have low expected error overall but high error on atypical groups of data due to biases inherent in the training process. Our main contribution is a pseudo-automatic method to discover such groups in foresight by performing causal interventions on simulated scenes.
arXiv Detail & Related papers (2022-02-08T05:14:16Z)
Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution. Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences. We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z)
A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation. Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.