A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU
- URL: http://arxiv.org/abs/2509.18114v1
- Date: Tue, 09 Sep 2025 23:43:05 GMT
- Title: A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU
- Authors: Javed I. Khan an Henry Uwabor Moye,
- Abstract summary: Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency.<n>A DPU-assisted framework can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency, particularly during the decode phase where load imbalance across GPU shards can cause throughput degradation and latency spikes. A DPU-assisted framework leveraged by BlueField-3 Data Processing Units can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference. By offloading monitoring tasks to the DPU and analyzing GPU telemetry and inter-node communication patterns, the resulting system can provide actionable feedback to inference controllers and schedulers. The goal of this study is three-fold i) identify the reported skews/imbalances/pathological conditions that arise in muti-GPU execution of a) LLM tensor computing (both during training and inference), b) identify their impact on computational performance, and c) make a critical assessment if those can be tracked for potential mitigation from a DPU network.
Related papers
- Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control [61.155940786140455]
Reinforcement learning (RL) has shown promising results in active flow control (AFC)<n>Current AFC benchmarks rely on external computational fluid dynamics (CFD) solvers, are not fully differentiable, and provide limited 3D and multi-agent support.<n>We introduce FluidGym, the first standalone, fully differentiable benchmark suite for RL in AFC.
arXiv Detail & Related papers (2026-01-21T14:13:44Z) - Integrated Sensing, Communication, and Computation for Over-the-Air Federated Edge Learning [52.904670248426626]
This paper studies an over-the-air federated edge learning (Air-FEEL) system with integrated sensing, communication, and computation.<n>We derive a low-complexity I SCC algorithm by alternately optimizing the batch size control and the network resource allocation.
arXiv Detail & Related papers (2025-08-21T02:46:46Z) - Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing [67.98609858326951]
Intra-DP is a high-performance collaborative inference system optimized for deep neural networks (DNNs) on mobile devices.<n>It reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.<n>The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-08T09:50:57Z) - eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems [4.745002208778503]
eACGM is a full-stack AI/ML system monitoring framework based on eBPF.<n>eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer.
arXiv Detail & Related papers (2025-05-25T09:25:39Z) - Machine Learning for Consistency Violation Faults Analysis [0.0]
This study presents a machine learning-based approach for analyzing the impact of consistency violation faults (cvfs) on distributed systems.<n>By computing program transition ranks and their corresponding effects, the proposed method quantifies the influence of cvfs on system behavior.<n> Experimental results demonstrate promising performance, with a test loss of 4.39 and a mean absolute error of 1.5.
arXiv Detail & Related papers (2025-05-20T22:11:43Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability [4.054484966653432]
A key measure of machine learning (ML) classification models' safety and reliability is their ability to resist small, targeted input perturbations.<n>We show that floating-point non-associativity coupled with asynchronous parallel programming on GPU is sufficient to result in misclassification.<n>We also show that standard adversarial robustness results may be overestimated up to 4.6 when not considering machine-level details.
arXiv Detail & Related papers (2025-03-21T14:19:45Z) - Understanding Silent Data Corruption in LLM Training [22.679273469491754]
We investigate the impact of silent data corruption (SDC) on large language training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs.<n>Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes.
arXiv Detail & Related papers (2025-02-17T22:07:49Z) - Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications [0.0]
Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect algorithms.
We investigate the statistical properties of floating-point non-associativity within modern parallel programming models.
We examine the recently-added deterministic options in PyTorch within the context of GPU deployment for deep learning.
arXiv Detail & Related papers (2024-08-09T16:07:37Z) - DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection [52.74152717667157]
We propose a lightweight module called Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in-temporal skeletal data.
It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and flops.
arXiv Detail & Related papers (2024-06-05T06:18:03Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Convolutional generative adversarial imputation networks for
spatio-temporal missing data in storm surge simulations [86.5302150777089]
Generative Adversarial Imputation Nets (GANs) and GAN-based techniques have attracted attention as unsupervised machine learning methods.
We name our proposed method as Con Conval Generative Adversarial Imputation Nets (Conv-GAIN)
arXiv Detail & Related papers (2021-11-03T03:50:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.