Understanding Silent Data Corruption in LLM Training
- URL: http://arxiv.org/abs/2502.12340v1
- Date: Mon, 17 Feb 2025 22:07:49 GMT
- Title: Understanding Silent Data Corruption in LLM Training
- Authors: Jeffrey Ma, Hengzhi Pei, Leonard Lausen, George Karypis,
- Abstract summary: We investigate the impact of silent data corruption (SDC) on large language training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs.<n>Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes.
- Score: 22.679273469491754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.
Related papers
- An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence [29.17303563861459]
We show that noisy data indeed induces training loss divergence.<n>We also find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates.
arXiv Detail & Related papers (2026-02-02T17:58:50Z) - Exploring Structural Degradation in Dense Representations for Self-supervised Learning [84.52554180480037]
We observe a counterintuitive phenomenon in self-supervised learning (SSL): longer training may impair the performance of dense prediction tasks.<n>We refer to this phenomenon as Self-supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state-of-the-art SSL methods.<n>We introduce a Dense representation Structure Estimator (DSE) composed of a class-relevance measure and an effective dimensionality measure.
arXiv Detail & Related papers (2025-10-20T08:40:16Z) - A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU [0.0]
Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency.<n>A DPU-assisted framework can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference.
arXiv Detail & Related papers (2025-09-09T23:43:05Z) - Machine Learning for Consistency Violation Faults Analysis [0.0]
This study presents a machine learning-based approach for analyzing the impact of consistency violation faults (cvfs) on distributed systems.<n>By computing program transition ranks and their corresponding effects, the proposed method quantifies the influence of cvfs on system behavior.<n> Experimental results demonstrate promising performance, with a test loss of 4.39 and a mean absolute error of 1.5.
arXiv Detail & Related papers (2025-05-20T22:11:43Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.
We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Unbiased and Sign Compression in Distributed Learning: Comparing Noise Resilience via SDEs [2.218667838700643]
Distributed methods are essential for handling machine learning pipelines comprising large-scale models and datasets.
Their robustness to large and heavy-tailed gradient noise, a phenomenon sometimes observed in language modeling, remains poorly understood.
This work addresses this gap by analyzing Distributed Compressed SGD (DCSGD) and Distributed SignSGD (DSignSGD) using differential equations.
arXiv Detail & Related papers (2025-02-24T09:39:17Z) - Scaling Sparse and Dense Retrieval in Decoder-Only LLMs [20.173669986209024]
Scaling large language models (LLMs) has shown great potential for improving retrieval model performance.
Previous studies have mainly focused on dense retrieval trained with contrastive loss (CL)
Sparse retrieval models consistently outperform dense retrieval across both in-domain (MSMARCO, TREC DL) and out-of-domain (BEIR) benchmarks.
arXiv Detail & Related papers (2025-02-21T15:28:26Z) - OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices.<n>We rebalanced the computational loads from data, model, and memory perspectives to address this issue.<n>Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z) - Revisiting the Disequilibrium Issues in Tackling Heart Disease Classification Tasks [5.834731599084117]
Two primary obstacles arise in the field of heart disease classification.
Electrocardiogram (ECG) datasets consistently demonstrate imbalances and biases across various modalities.
We propose a Channel-wise Magnitude Equalizer (CME) on signal-encoded images.
We also propose the Inverted Weight Logarithmic Loss (IWL) to alleviate imbalances among the data.
arXiv Detail & Related papers (2024-07-19T09:50:49Z) - DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models [3.3484462092188005]
We introduce a lazy asynchronous multi-level approach that takes advantage of the fact that the tensors making up the model and state shards remain immutable for extended periods of time.
The results show up to 48$times$ faster checkpointing and 2.2$times$ faster end-to-end training compared with state-of-art checkpointing approaches.
arXiv Detail & Related papers (2024-06-15T18:30:40Z) - On Improving the Algorithm-, Model-, and Data- Efficiency of Self-Supervised Learning [18.318758111829386]
We propose an efficient single-branch SSL method based on non-parametric instance discrimination.
We also propose a novel self-distillation loss that minimizes the KL divergence between the probability distribution and its square root version.
arXiv Detail & Related papers (2024-04-30T06:39:04Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Reducing self-supervised learning complexity improves weakly-supervised
classification performance in computational pathology [0.0]
Self-supervised learning (SSL) methods allow for large-scale analyses on non-annotated data.
We investigated the complexity of SSL in relation to classification performance with the utilization of consumer-grade hardware.
Our experiments demonstrate that we can improve downstream classification performance whilst reducing SSL training duration by 90%.
arXiv Detail & Related papers (2024-03-07T14:56:06Z) - Prompt Perturbation Consistency Learning for Robust Language Models [47.021022978847036]
Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks.
We show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models.
We propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples.
arXiv Detail & Related papers (2024-02-24T15:00:58Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Improving GANs with A Dynamic Discriminator [106.54552336711997]
We argue that a discriminator with an on-the-fly adjustment on its capacity can better accommodate such a time-varying task.
A comprehensive empirical study confirms that the proposed training strategy, termed as DynamicD, improves the synthesis performance without incurring any additional cost or training objectives.
arXiv Detail & Related papers (2022-09-20T17:57:33Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.