Fugu-MT 論文翻訳(概要): AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

論文の概要: AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

arxiv url: http://arxiv.org/abs/2511.16397v1
Date: Thu, 20 Nov 2025 14:15:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-21 17:08:52.667665
Title: AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
Title（参考訳）: AICC:Parse HTML Finer、モデルの改善 -- モデルベースのHTMLパーサで構築された7.3T AI対応コーパス
Authors: Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He,
Abstract要約: 我々は、コンテンツ抽出をシーケンスラベリング問題として再構成する新しい抽出パイプラインであるMinerU-HTMLを紹介する。 MainWebBenchでは、7,887の注釈付きWebページ、MinerU-HTML 81.8%のROUGE-N F1をTrfilaturaの63.6%と比較した。
参考スコア（独自算出の注目度）: 55.626118388697456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
Abstract（参考訳）: Webデータの品質は大規模な言語モデルでは不可欠だが、ほとんどのキュレーションはフィルタリングと重複の除去に重点を置いており、HTMLからテキストへの抽出を固定された前処理ステップとして扱う。既存のウェブコーパスは、Trafilaturaのようなヒューリスティックベースの抽出器に依存しており、文書構造や公式、コード、テーブルなどの頻繁に破損した構造化要素の保存に苦労している。抽出品質の向上は,下流性能に対するアグレッシブなフィルタリング戦略と同じくらい影響があると仮定する。我々は、コンテンツ抽出を0.6Bパラメータ言語モデルによって解決されたシーケンスラベリング問題として再構成する新しい抽出パイプラインであるMinerU-HTMLを紹介する。テキスト密度ヒューリスティックスとは異なり、MinerU-HTMLはセマンティック理解を活用し、マークダウンに変換する前にセマンティック要素を明示的に分類する2段階のフォーマッティングパイプラインを使用している。重要なのは、モデルベースのアプローチは本質的にスケーラブルであるのに対して、ヒューリスティックな手法は限られた改善経路を提供する。 MainWebBenchでは、7,887 の注釈付き Web ページのベンチマークで、MinerU-HTML は Trafilatura の 63.6 % に対して 81.8 % のROUGE-N F1 を達成しています。 MinerU-HTMLを用いて、2つのCommon Crawlスナップショットから7.3トリリオントークン多言語コーパスであるAICC(AI-ready Common Crawl)を構築する。 AICCとTrafilaturaが抽出したTfCCが同一のフィルタリングを行う制御事前トレーニング実験では、AICC(62Bトークン)でトレーニングされたモデルが13のベンチマークで平均精度を50.8\%達成し、TfCCを1.08ppで上回った。 AICCは主要なベンチマークでRefinedWebとFineWebを上回っている。 MainWebBench, MinerU-HTML, AICC を公開し,HTML 抽出が Web コーパス構築の重要かつ過小評価されるコンポーネントであることを実証した。

論文の概要: AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

関連論文リスト