Characterizing GPU Resilience and Impact on AI/HPC Systems
- URL: http://arxiv.org/abs/2503.11901v3
- Date: Sat, 28 Jun 2025 06:07:45 GMT
- Title: Characterizing GPU Resilience and Impact on AI/HPC Systems
- Authors: Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Brett Bode, Gregory Bauer, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer,
- Abstract summary: This study characterizes GPU resilience in Delta HPC, a large-scale AI system.<n>We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors.
- Score: 5.4879032865205986
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This study characterizes GPU resilience in Delta HPC, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. Delta HPC is operated by the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
Related papers
- GPU in the Blind Spot: Overlooked Security Risks in Transportation [3.3296812191509786]
This paper highlights GPU security as a critical blind spot in transportation cybersecurity.<n>To support this concern, it also presents a case study showing the impact of stealthy unauthorized crypto miners on critical AI workloads.
arXiv Detail & Related papers (2025-08-04T02:25:43Z) - Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction [76.62155593340763]
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales.<n>However, the graph representations required for this task tend to be densely connected.<n>We present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph.
arXiv Detail & Related papers (2025-07-04T23:53:47Z) - GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency [3.1882747895372217]
GPUMC is a stateless model checker to check the correctness of GPU shared-memory programs under scoped-RC11 weak memory model.<n>We evaluate GPUMC with benchmarks and real-life GPU programs.
arXiv Detail & Related papers (2025-05-26T16:47:44Z) - HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing [3.50604837678178]
We propose a memory-intensive co-processor that enhances GPU resource utilization during large-batched LLM inference.<n>By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency.<n>Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system.
arXiv Detail & Related papers (2025-04-18T03:31:08Z) - Nonuniform-Tensor-Parallelism: Mitigating GPU failure impact for Scaled-up LLM Training [3.43728657617475]
We propose nonuniform-tensor-parallelism (NTP) to mitigate this amplified impact of GPU failures.<n>We also propose a rack-design with improved electrical and thermal capabilities in order to sustain power-boosting of scale-up domains that have experienced failures.
arXiv Detail & Related papers (2025-04-08T14:35:40Z) - HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU.<n>We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z) - Forecasting GPU Performance for Deep Learning Training and Inference [10.741682409837612]
NeuSight is a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution.<n>NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU.<n>It reduces the percentage error from 121.4% and 30.8% to 2.3% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior work.
arXiv Detail & Related papers (2024-07-18T18:47:52Z) - NeRF-XL: Scaling NeRFs with Multiple GPUs [72.75214892939411]
We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPU.
We show improvements in reconstruction quality with larger parameter counts and speed improvements with more GPU.
We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km2 city area.
arXiv Detail & Related papers (2024-04-24T21:43:15Z) - Turn Waste into Worth: Rectifying Top-$k$ Router of MoE [111.12838294273033]
MoE models are popular for training large language models due to their computational efficiency.
The commonly used top-$k$ routing mechanism suffers from redundancy and memory costs due to the unbalanced routing.
To address the dropped tokens and padding, we propose the Rectify-ify, comprising the Intra-GPU Rectification and the Fill-in Rectification.
The combination of them achieves superior performance, surpassing the accuracy of the vanilla top-1 router by 4.7%.
arXiv Detail & Related papers (2024-02-17T06:23:27Z) - Whispering Pixels: Exploiting Uninitialized Register Accesses in Modern GPUs [6.1255640691846285]
We showcase the existence of a vulnerability on products of 3 major vendors - Apple, NVIDIA and Qualcomm.
This vulnerability poses unique challenges to an adversary due to opaque scheduling and register remapping algorithms.
We implement information leakage attacks on intermediate data of Convolutional Neural Networks (CNNs) and present the attack's capability to leak and reconstruct the output of Large Language Models (LLMs)
arXiv Detail & Related papers (2024-01-16T23:36:48Z) - WebGPU-SPY: Finding Fingerprints in the Sandbox through GPU Cache Attacks [0.7400926717561453]
We present a new attack vector for microarchitectural attacks in web browsers.
We develop a cache side channel attack on the compute stack of the GPU that spies on victim activities.
We demonstrate that GPU-based cache attacks can achieve a precision of 90 for website fingerprinting of 100 top websites.
arXiv Detail & Related papers (2024-01-09T04:21:43Z) - FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems [5.572152653851948]
FULL-W2V exploits the opportunities for data reuse in the W2V algorithm to reduce access to low memory levels and improve temporal locality.
Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality.
arXiv Detail & Related papers (2023-12-12T21:22:07Z) - Benchmarking GPUs on SVBRDF Extractor Model [0.0]
In this work, we try to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
In this work, we tried to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
arXiv Detail & Related papers (2023-10-19T17:09:06Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Data-Efficient Instance Segmentation with a Single GPU [88.31338435907304]
We introduce a data-efficient segmentation method we used in the 2021 VIPriors Instance Challenge.
Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox.
Our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants.
arXiv Detail & Related papers (2021-10-01T07:36:20Z) - Out-of-Core GPU Gradient Boosting [0.0]
We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time.
This is the first out-of-core GPU implementation of gradient boosting.
arXiv Detail & Related papers (2020-05-19T00:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.