AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
- URL: http://arxiv.org/abs/2511.16397v2
- Date: Wed, 26 Nov 2025 12:28:02 GMT
- Title: AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
- Authors: Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Runyuan Ma, Chenlin Su, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He,
- Abstract summary: We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem.<n>On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML 81.8% ROUGE-N F1 compared to Trafilatura's 63.6%, with exceptional structured element preservation.
- Score: 54.623900859999424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters [55.34921520578968]
Vision-language models (VLMs) have achieved impressive unified recognition of text and formulas.<n>We propose UniRec-0.1B, a unified recognition model with only 0.1B parameters.<n>It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents.
arXiv Detail & Related papers (2025-12-24T10:35:21Z) - Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM [35.10225876152952]
We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models.<n>We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors.<n>Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods.
arXiv Detail & Related papers (2025-11-28T12:04:46Z) - Semantic Outlier Removal with Embedding Models and LLMs [0.45080838507508303]
We introduce SORE (Semantic Outlier Removal), a cost-effective, transparent method to identify and excise unwanted text segments.<n>SORE achieves near-LLM extraction precision at a fraction of the cost.<n>Our system is currently deployed in production, processing millions of documents daily across multiple languages.
arXiv Detail & Related papers (2025-06-19T23:06:12Z) - NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction [6.09502686736443]
We introduce a concrete evaluation framework for web data record extraction.<n>Our framework generates evaluation snapshots, annotates supervision labels, and employs structure-aware metrics for consistent scoring.<n>It also incorporates preprocessing to optimize input for Large Language Model (LLM)-based approaches.
arXiv Detail & Related papers (2025-05-21T21:03:37Z) - ReaderLM-v2: Small Language Model for HTML to Markdown and JSON [7.9969849952515775]
We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction.<n>Our model processes documents up to 512K messy HTML into clean or markdown formats with high accuracy -- making it an ideal tool for grounding large language models.
arXiv Detail & Related papers (2025-03-03T03:57:04Z) - Tensor Product Attention Is All You Need [61.3442269053374]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.<n>TPA achieves improved model quality alongside memory efficiency.<n>Based on TPA, we introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis [21.25786478579275]
Handwritten document recognition is one of the most challenging tasks in computer vision.<n>Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis.<n>This paper introduces HAND, a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks.
arXiv Detail & Related papers (2024-12-25T20:36:29Z) - xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token [108.7069350303884]
xRAG is an innovative context compression method tailored for retrieval-augmented generation.<n>xRAG seamlessly integrates document embeddings into the language model representation space.<n> Experimental results demonstrate that xRAG achieves an average improvement of over 10% across six knowledge-intensive tasks.
arXiv Detail & Related papers (2024-05-22T16:15:17Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.