Related papers: Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

URL: http://arxiv.org/abs/2506.10426v1
Date: Thu, 12 Jun 2025 07:24:59 GMT
Title: Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models
Authors: Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia,
Abstract summary: This paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI.<n>We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies.
Score: 7.486731499255164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs' characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the distributed nature of these frameworks introduces unique bug root causes, such as allocation strategy error and distributed communication error. Diagnosing and fixing complex bugs remains challenging due to factors like the disconnect between symptoms and root causes, high bug reproduction costs, and low-level or cross-component interactions. Interestingly, we observe that 48% of bug fixes require minimal code changes (<=10 LOC) and follow simple strategies such as conditional logic optimization, parameter handling enhancement, or version compatibility handling, indicating potential for automation. Based on these insights, we offer several implications for improving the reliability of both distributed training and inference frameworks and their dependent LLM projects, while also identifying opportunities to leverage LLM-based tools for automated debugging and repair.

Related papers

Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z)
LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets [5.191767648600372]
This study investigates the utility of Large Language Models for detecting tangled code changes by leveraging both commit messages and method-level code diffs.<n>Our results demonstrate that combining commit messages with code diffs significantly enhances model performance, with the combined few-shot and chain-of-thought prompting achieving an F1-score of 0.88.<n>These findings are encouraging for the research community, as method-level bug prediction remains an open research problem, largely due to the lack of noise-free bug datasets.
arXiv Detail & Related papers (2025-05-13T06:26:13Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z)
Towards Fault Tolerance in Multi-Agent Reinforcement Learning [10.314885753121905]
Agent faults pose a significant threat to the performance of multi-agent reinforcement learning (MARL) algorithms.<n>This paper enhances the fault tolerance of MARL by combining optimized model architecture with a tailored training data sampling strategy.
arXiv Detail & Related papers (2024-11-30T16:56:29Z)
An Empirical Study on LLM-based Agents for Automated Bug Fixing [2.433168823911037]
Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically. We examine seven proprietary and open-source systems on the SWE-bench Lite benchmark for automated bug fixing.
arXiv Detail & Related papers (2024-11-15T14:19:15Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
A Deep Dive into Large Language Models for Automated Bug Localization and Repair [12.756202755547024]
Large language models (LLMs) have shown impressive effectiveness in various software engineering tasks, including automated program repair (APR) In this study, we take a deep dive into automated bug fixing utilizing LLMs. This methodological separation of bug localization and fixing using different LLMs enables effective integration of diverse contextual information. Toggle achieves the new state-of-the-art (SOTA) performance on the CodeXGLUE code refinement benchmark.
arXiv Detail & Related papers (2024-04-17T17:48:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.