Related papers: A First Look at Bugs in LLM Inference Engines

A First Look at Bugs in LLM Inference Engines

URL: http://arxiv.org/abs/2506.09713v1
Date: Wed, 11 Jun 2025 13:25:36 GMT
Title: A First Look at Bugs in LLM Inference Engines
Authors: Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, Yun Ma,
Abstract summary: We present the first empirical study on bugs in LLM inference engines.<n>We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs.<n>Our findings reveal six major bug symptoms and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location.
Score: 20.52104110967081
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, and commonality. Our findings reveal six major bug symptoms and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers.

Related papers

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling [3.0882445204950013]
We present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software.<n>We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums.<n>We built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset.
arXiv Detail & Related papers (2026-01-21T18:13:10Z)
Understanding LLM-Centric Challenges for Deep Learning Frameworks: An Empirical Analysis [32.10519814607409]
Large language models (LLMs) have driven significant progress across a wide range of real-world applications.<n>Deep learning (DL) frameworks provide this foundation by enabling efficient model construction, distributed execution, and optimized deployment.<n>Poor usability, limited functionality, and subtle bugs in DL frameworks may hinder development efficiency and cause severe failures or resource waste.
arXiv Detail & Related papers (2025-06-16T05:45:03Z)
Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models [7.486731499255164]
This paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI.<n>We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies.
arXiv Detail & Related papers (2025-06-12T07:24:59Z)
Empirical Evaluation of Generalizable Automated Program Repair with Large Language Models [4.757323827658957]
Automated Program Repair proposes bug fixes to aid developers in maintaining software.<n>Recent works have shown that LLMs can be used to generate repairs.<n>We evaluate a diverse set of 13 recent models, including open ones (e.g., Llama 3.3, Qwen 2.5 Coder, and DeepSeek R1 (dist.)) and closed ones (e.g., o3-mini, GPT-4o, Claude 3.7 Sonnet, Gemini 2.0 Flash)
arXiv Detail & Related papers (2025-06-03T18:15:14Z)
ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
TOOLSCAN is a new benchmark to identify error patterns in LLM output on tool-use tasks.<n>We show that even the most prominent LLMs exhibit these error patterns in their outputs.<n>Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Language Agents Meet Causality -- Bridging LLMs and Causal World Models [50.79984529172807]
We propose a framework that integrates causal representation learning with large language models. This framework learns a causal world model, with causal variables linked to natural language expressions. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities.
arXiv Detail & Related papers (2024-10-25T18:36:37Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Exploring Large Language Models in Resolving Environment-Related Crash Bugs: Localizing and Repairing [36.4673637256627]
Large language models (LLMs) have shown promise in software engineering tasks.<n>We conduct the first comprehensive study to assess the capability of LLMs in resolving real-world environment crash bugs.<n>Our findings reveal that localization is the primary challenge for resolving code crashes, while repair poses a greater challenge for environment crashes.
arXiv Detail & Related papers (2023-12-16T13:41:04Z)
Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation. Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z)
The GitHub Recent Bugs Dataset for Evaluating LLM-based Debugging Applications [20.339673903885483]
Large Language Models (LLMs) have demonstrated strong natural language processing and code synthesis capabilities. Details about LLM training data are often not made public, which has caused concern as to whether existing bug benchmarks are included. We present the GitHub Recent Bugs dataset, which includes 76 real-world Java bugs that were gathered after the OpenAI data cut-off point.
arXiv Detail & Related papers (2023-10-20T02:37:44Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization [12.80414941523501]
AutoFL generates an explanation of the bug along with a suggested fault location. Experiments on 798 real-world bugs in Java and Python reveal AutoFL improves method-level acc@1 by up to 233.3% over baselines.
arXiv Detail & Related papers (2023-08-10T10:26:55Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.