Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study
- URL: http://arxiv.org/abs/2601.19912v1
- Date: Thu, 25 Dec 2025 11:59:54 GMT
- Title: Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study
- Authors: Duo Chai, Zizhen Liu, Shuhuai Wang, Songwei Pei, Cheng Liu, Huawei Li, Shangguang Wang,
- Abstract summary: Large language models (LLMs) are highly compute- and memory-intensive.<n>LLMs' resilience to soft errors may differ substantially from earlier models.<n>We conduct the first instruction-level fault injection study of LLM inference.
- Score: 11.583997354005795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.
Related papers
- A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - Exploring LLM-based Frameworks for Fault Diagnosis [2.2562573557834686]
Large Language Model (LLM)-based systems present new opportunities for autonomous health monitoring in sensor-rich industrial environments.<n>This study explores the potential of LLMs to detect and classify faults directly from sensor data, while producing inherently explainable outputs through natural language reasoning.
arXiv Detail & Related papers (2025-09-27T04:53:15Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - Phishing Detection in the Gen-AI Era: Quantized LLMs vs Classical Models [1.4999444543328293]
Phishing attacks are becoming increasingly sophisticated, underscoring the need for detection systems that strike a balance between high accuracy and computational efficiency.<n>This paper presents a comparative evaluation of traditional Machine Learning (ML), Deep Learning (DL), and quantized small- parameter Large Language Models (LLMs) for phishing detection.<n>We show that while LLMs currently underperform compared to ML and DL methods in terms of raw accuracy, they exhibit strong potential for identifying subtle, context-based phishing cues.
arXiv Detail & Related papers (2025-07-10T04:01:52Z) - Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z) - Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness [39.57155321515097]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks.
It remains unclear whether LLMs exhibit robustness in learning on graphs.
arXiv Detail & Related papers (2024-07-16T09:05:31Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning [70.48605869773814]
Catastrophic forgetting (CF) is a phenomenon that occurs in machine learning when a model forgets previously learned information.<n>This study empirically evaluates the forgetting phenomenon in large language models during continual instruction tuning.
arXiv Detail & Related papers (2023-08-17T02:53:23Z) - Soft Error Reliability Analysis of Vision Transformers [14.132398744731635]
Vision Transformers (ViTs) that leverage self-attention mechanism have shown superior performance on many classical vision tasks.
Existing ViTs works mainly optimize performance and accuracy, but ViTs reliability issues induced by soft errors have generally been overlooked.
In this work, we study the reliability of ViTs and investigate the vulnerability from different architecture granularities.
arXiv Detail & Related papers (2023-02-21T06:17:40Z) - Degradation Prediction of Semiconductor Lasers using Conditional
Variational Autoencoder [0.0]
We propose a new data-driven approach to predict the degradation trend without requiring any specific knowledge or using any physical model.
The proposed approach is based on an unsupervised technique, a conditional variational autoencoder, and validated using vertical-cavity surface-emitting laser (VCSEL) and tunable edge emitting laser reliability data.
The experimental results confirm that our model (i) achieves a good degradation prediction and generalization performance by yielding an F1 score of 95.3%, (ii) outperforms several baseline ML based anomaly detection techniques, and (iii) helps to shorten the aging tests by early predicting the failed devices
arXiv Detail & Related papers (2022-11-05T08:10:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.