Fugu-MT 論文翻訳(概要): An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

論文の概要: An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

arxiv url: http://arxiv.org/abs/2602.02400v1
Date: Mon, 02 Feb 2026 17:58:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.341478
Title: An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence
Title（参考訳）: 雑音データとLLMによる損失分散の予測に関する実証的研究
Authors: Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis,
Abstract要約: ノイズデータはトレーニング損失のばらつきを実際に引き起こすことを示す。また,ノイズによる発散は,高い学習率によって引き起こされるものと異なるアクティベーションパターンを示すことがわかった。
参考スコア（独自算出の注目度）: 29.17303563861459
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.
Abstract（参考訳）: 大規模事前トレーニングデータセットは、大規模言語モデル(LLM)の成功を促進する。しかし、これらのウェブスケールコーパスは、不規則なウェブコンテンツやデータ固有のランダム性のために、必然的に大量のノイズデータを含んでいる。 LLMプレトレーナーは、このようなノイズが大規模LLMプレトレーナーの不安定性に寄与していると推測されることが多いが、最悪の場合、この現象は未だ理解されていない。制御された合成ランダムノイズをクリーンなデータセットに注入することにより、480Mから5.2Bパラメータのモデルサイズにわたるトレーニングダイナミクスを解析する。また,ノイズの種類,ノイズ量,モデルスケールに大きく依存していることが示唆された。さらに、ノイズによる発散は、高い学習率によって引き起こされるものと異なるアクティベーションパターンを示し、これらの2つの障害モードを区別する診断手段を提供する。これらの結果は,LLM事前学習におけるノイズデータが損失分散に与える影響を,大規模かつ制御的に評価する。

論文の概要: An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

関連論文リスト