Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs
- URL: http://arxiv.org/abs/2601.13655v1
- Date: Tue, 20 Jan 2026 06:42:56 GMT
- Title: Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs
- Authors: Guangba Yu, Zirui Wang, Yujie Huang, Renyi Zhong, Yuedong Zhong, Yilun Wang, Michael R. Lyu,
- Abstract summary: We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
- Score: 50.075587392477935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The democratization of open-source Large Language Models (LLMs) allows users to fine-tune and deploy models on local infrastructure but exposes them to a First Mile deployment landscape. Unlike black-box API consumption, the reliability of user-managed orchestration remains a critical blind spot. To bridge this gap, we conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems. Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack. We identify three key phenomena: (1) Diagnostic Divergence: runtime crashes distinctively signal infrastructure friction, whereas incorrect functionality serves as a signature for internal tokenizer defects. (2) Systemic Homogeneity: Root causes converge across divergent series, confirming reliability barriers are inherent to the shared ecosystem rather than specific architectures. (3) Lifecycle Escalation: Barriers escalate from intrinsic configuration struggles during fine-tuning to compounded environmental incompatibilities during inference. Supported by our publicly available dataset, these insights provide actionable guidance for enhancing the reliability of the LLM landscape.
Related papers
- A Secure and Private Distributed Bayesian Federated Learning Design [56.92336577799572]
Distributed Federated Learning (DFL) enables decentralized model training across large-scale systems without a central parameter server.<n>DFL faces three critical challenges: privacy leakage from honest-but-curious neighbors, slow convergence due to the lack of central coordination, and vulnerability to Byzantine adversaries aiming to degrade model accuracy.<n>We propose a novel DFL framework that integrates Byzantine robustness, privacy preservation, and convergence acceleration.
arXiv Detail & Related papers (2026-02-23T16:12:02Z) - Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection [53.45696787935487]
Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes.<n>In real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID.<n>We propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection.
arXiv Detail & Related papers (2026-02-01T05:54:59Z) - Tri-LLM Cooperative Federated Zero-Shot Intrusion Detection with Semantic Disagreement and Trust-Aware Aggregation [5.905949608791961]
This paper introduces a semantics-driven federated IDS framework that incorporates language-derived semantic supervision into federated optimization.<n>The framework achieves over 80% zero-shot detection accuracy on unseen attack patterns, improving zero-day discrimination by more than 10% compared to similarity-based baselines.
arXiv Detail & Related papers (2026-01-30T16:38:05Z) - The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern? [14.472036099680961]
We propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern.<n>We fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preservings, and varying degrees of semantic gap measured by CodeBLEU.<n>Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.
arXiv Detail & Related papers (2026-01-30T07:19:17Z) - CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z) - Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure [2.0880077827773227]
We introduce Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation.<n>Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining.
arXiv Detail & Related papers (2026-01-15T16:28:14Z) - Hypothesize-Then-Verify: Speculative Root Cause Analysis for Microservices with Pathwise Parallelism [19.31110304702373]
SpecRCA is a speculative root cause analysis framework that adopts a textithypothesize-then-verify paradigm.<n>Preliminary experiments on the AIOps 2022 demonstrate that SpecRCA achieves superior accuracy and efficiency compared to existing approaches.
arXiv Detail & Related papers (2026-01-06T05:58:25Z) - Mechanistic Analysis of Circuit Preservation in Federated Learning [0.3823356975862005]
Federated Learning (FL) enables collaborative training of models on decentralized data, but its performance degrades significantly under Non-IID data conditions.<n>This paper investigates the canonical FedAvg algorithm through the lens of Mechanistic Interpretability (MI) to diagnose this failure mode.
arXiv Detail & Related papers (2025-12-28T19:03:14Z) - EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels [85.78886153628663]
Open-Set Domain Generalization aims to enable deep learning models to recognize unseen categories in new domains.<n>Label noise hinders open-set domain generalization by corrupting source-domain knowledge.<n>We propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM) to bridge domain gaps.
arXiv Detail & Related papers (2025-10-14T16:23:11Z) - DMFI: Dual-Modality Fine-Tuning and Inference Framework for LLM-Based Insider Threat Detection [9.049925971684837]
Insider threat modality (ITD) poses a persistent and high-impact challenge in cybersecurity.<n>Traditional models often struggle to capture semantic intent and complex behavior dynamics.<n>We propose DMFI, a dual-modality framework that integrates semantic inference with behavior-aware fine-tuning.
arXiv Detail & Related papers (2025-08-06T18:44:40Z) - Learning Unified System Representations for Microservice Tail Latency Prediction [8.532290784939967]
Microservice architectures have become the de facto standard for building scalable cloud-native applications.<n>Traditional approaches often rely on per-request latency metrics, which are highly sensitive to transient noise.<n>We propose USRFNet, a deep learning network that explicitly separates and models traffic-side and resource-side features.
arXiv Detail & Related papers (2025-08-03T07:46:23Z) - Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.