Fugu-MT 論文翻訳(概要): Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

論文の概要: Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

arxiv url: http://arxiv.org/abs/2508.15096v1
Date: Wed, 20 Aug 2025 22:16:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.10964
Title: Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
Title（参考訳）: Nemotron-CC-Math:133億ドル規模の高品質数学プレトレーニングデータセット
Authors: Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro,
Abstract要約: 我々は,Common Crawlから構築した大規模で高品質な数学的コーパスであるNemotron-CC-Mathを紹介する。我々のパイプラインは、レイアウト対応のレンダリングをlynxとターゲットのクリーニングステージで活用することで、数学を回復する。ネモトロン-T 8Bモデルの事前トレーニングに使用すると、我々のコーパスはMATHで+14.8ゲイン、MBPP+で+4.6ゲインを得る。
参考スコア（独自算出の注目度）: 38.74581584840398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.
Abstract（参考訳）: 数学やコードなどの構造化データに基づく大規模言語モデル(LLM)の事前学習は、推論能力を大幅に向上させる。しかし、Common Crawlから構築された既存の数学中心のデータセットは、脆い抽出ヒューリスティックス、HTMLからテキストへの変換の欠如、数学的構造を確実に保存できないことによる劣化した品質に悩まされている。本研究では,Common Crawlから構築した大規模で高品質な数学的コーパスであるNemotron-CC-Mathを紹介する。これまでの取り組みとは異なり、我々のパイプラインは、lynxによるレイアウト認識レンダリングとLLMベースのクリーニングステージを活用して、様々なフォーマット(MathJax、KaTeX、MathMLなど)で数学を復元する。このアプローチは、ボイラプレートを除去し、表記をLaTeX表現に標準化し、矛盾を修正しながら、方程式や符号ブロックの構造的整合性を維持する。我々は,Nemotron-CC-Math-3+(133Bトークン),Nemotron-CC-Math-4+(52Bトークン)という,大規模で高品質な数学コーパスを収集した。特に、Nemotron-CC-Math-4+は、MegaMath、FineMath、OpenWebMathを含む全てのオープンな数学データセットを超えるだけでなく、以前は高品質な数学事前訓練データセットであったFineMath-4+の5.5倍のトークンを含んでいる。 The corpus yields on MATH and +4.6 to +14.3 gains on MBPP+ on strong baseline, and also improve general-domain performance on MMLU and MMLU-Stem。本稿では,Webスケールのノイズのあるデータから科学的内容を含む確実に抽出し,数学,コード,一般的な推論において測定可能な利得を得るとともに,オープンな算数事前学習コーパスの中で新たな最先端の手法を確立するための,最初のパイプラインを提案する。オープンソースの取り組みをサポートするため、コードとデータセットをリリースしています。

論文の概要: Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

関連論文リスト