Fugu-MT 論文翻訳(概要): Scale Dependent Data Duplication

論文の概要: Scale Dependent Data Duplication

arxiv url: http://arxiv.org/abs/2603.06603v1
Date: Wed, 18 Feb 2026 05:22:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-15 16:38:22.41933
Title: Scale Dependent Data Duplication
Title（参考訳）: スケール依存型データ重複
Authors: Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho,
Abstract要約: セマンティック複製は、トレーニング中に正確に複製されるように、ますます機能します。 EmbeddingGemma-300mを使って、1億9200万のFineWeb-Edu-Dedupドキュメントを埋め込んだ。我々は,事前学習コーパスの意味的特異性に制限があるため,実践者が期待するスケーリングから逸脱を推定できる明示的なスケーリング法を導出する。
参考スコア（独自算出の注目度）: 29.59812821602787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m. For moderate corpus sizes, the cosine similarity between nearest-neighbors follows an isotropic power law baseline. However, as corpus size grows to hundreds of billions of tokens, the nearest-neighbor similarities deviate sharply, indicating accelerated semantic collisions. Finally, controlled pretraining on data sampled with replacement from pools of finite unique documents shows that limited uniqueness yields mild degradation for small models, but rapidly increasing loss penalties for larger models, breaking naive scaling extrapolation. We derive explicit scaling laws that allow practitioners to estimate deviation from expected scaling due to limited semantic uniqueness of the pretraining corpus. Our results identify and resolve an unstudied source of scale-dependence, allowing for more accurate prediction at scale.
Abstract（参考訳）: 事前トレーニング中のデータの重複は、一般化を低下させ、メモリ化を招き、攻撃的な重複パイプラインを動機付ける。しかし、Webスケールでは、'duplicate'を構成するものは不明確であり、サーフェスフォームのマッチング以外にも、セマンティックに等価なドキュメント(例えば翻訳)は、モデルが十分に機能すれば冗長なトレーニング信号を誘導する可能性がある。実際、これは意味的重複がトレーニング中の正確な重複のようにますます機能することを意味する。重複は2つの方法でスケール依存していることを示す。第一に、モデル能力が向上するにつれて、意味論的に等価なドキュメントに対するクロスエントロピー損失勾配がより整合化する。対照的に、より小さなモデルは、意味的類似性ではなく表面類似性(例えば共有トークン)を反映した勾配を生成する。次に、EmbeddingGemma-300mを使って、1億9200万のFineWeb-Edu-Dedupドキュメントを埋め込んだ。適度なコーパスサイズの場合、最も近い隣人間のコサイン類似性は等方的電力法則のベースラインに従う。しかし、コーパスサイズが数十億のトークンに成長するにつれて、近隣の類似性は急速に減少し、セマンティック衝突が加速することを示している。最後に、有限個のユニークな文書のプールを置き換えてサンプリングされたデータに基づいて事前訓練を行うことにより、限定的なユニークさは小さなモデルでは軽度に劣化するが、より大きなモデルでは損失ペナルティが急速に増加し、単純なスケーリング外挿を損なうことを示す。我々は,事前学習コーパスの意味的特異性に制限があるため,実践者が期待するスケーリングから逸脱を推定できる明示的なスケーリング法を導出する。以上の結果から,スケール依存の未調査源を同定・解決し,より正確なスケール予測を可能にした。

論文の概要: Scale Dependent Data Duplication

関連論文リスト