An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
- URL: http://arxiv.org/abs/2602.02400v1
- Date: Mon, 02 Feb 2026 17:58:50 GMT
- Title: An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
- Authors: Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis,
- Abstract summary: We show that noisy data indeed induces training loss divergence.<n>We also find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates.
- Score: 29.17303563861459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.
Related papers
- Noisy Analysis of Quantum SMOTE on Condition Monitoring and Fault Classification in Industrial and Energy Systems [0.5505634045241289]
Imbalanced machine learning models are a fundamental issue in industrial condition monitoring and fault classification pipelines.<n>This work presents a detailed benchmarking and investigation of classical classifiers under class imbalance mitigation.<n>The results show that QSMOTE consistently corrects distributional skew and significantly enhances the performance of non-linear classifiers.
arXiv Detail & Related papers (2026-01-16T16:44:38Z) - Scaling Behavior of Discrete Diffusion Language Models [74.72926629897636]
We study the scaling behavior of discrete diffusion language models (DLMs) on different noise types.<n>Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs.<n>We scale our uniform diffusion model up to 10B parameters trained for $1022$ FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
arXiv Detail & Related papers (2025-12-11T17:54:10Z) - MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training [3.619444603816032]
Large-scale repositories introduce noisy or mislabeled examples that degrade both accuracy and robustness.<n>We propose MANTRA, a Multi-stage Adaptive Noise TReAtment framework that embeds noise diagnosis and mitigation directly into the fine-tuning process.
arXiv Detail & Related papers (2025-12-03T23:09:55Z) - On the Collapse Errors Induced by the Deterministic Sampler for Diffusion Models [38.99546114710447]
Collapse errors are a previously unrecognized phenomenon in ODE-based diffusion sampling.<n>We observe a see-saw effect, where score learning in low noise regimes adversely impacts the one in high noise regimes.<n>This misfitting in high noise regimes, coupled with the dynamics of deterministic samplers, ultimately causes collapse errors.
arXiv Detail & Related papers (2025-08-22T07:26:24Z) - Impact of Noisy Supervision in Foundation Model Learning [91.56591923244943]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.<n>We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - An Investigation of Noise in Morphological Inflection [21.411766936034]
We investigate the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion.
We compare the effect of different types of noise on multiple state-of-the-art inflection models.
We propose a novel character-level masked language modeling (CMLM) pretraining objective and explore its impact on the models' resistance to noise.
arXiv Detail & Related papers (2023-05-26T02:14:34Z) - Improving the Robustness of Summarization Models by Detecting and
Removing Input Noise [50.27105057899601]
We present a large empirical study quantifying the sometimes severe loss in performance from different types of input noise for a range of datasets and model sizes.
We propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any training, auxiliary models, or even prior knowledge of the type of noise.
arXiv Detail & Related papers (2022-12-20T00:33:11Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.