Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
- URL: http://arxiv.org/abs/2602.19548v1
- Date: Mon, 23 Feb 2026 06:41:57 GMT
- Title: Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
- Authors: Jeffrey Li, Josh Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Hall, Oncel Tuzel, Percy Liang, Ludwig Schmidt, Hadi Pour Ansari, Fartash Faghri,
- Abstract summary: We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
- Score: 78.36592534300839
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the first pre-processing steps for constructing web-scale LLM pretraining datasets involves extracting text from HTML. Despite the immense diversity of web content, existing open-source datasets predominantly apply a single fixed extractor to all webpages. In this work, we investigate whether this practice leads to suboptimal coverage and utilization of Internet data. We first show that while different extractors may lead to similar model performance on standard language understanding tasks, the pages surviving a fixed filtering pipeline can differ substantially. This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71% while maintaining benchmark performance. We further show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance, with differences of up to 10 percentage points (p.p.) on WikiTQ and 3 p.p. on HumanEval.
Related papers
- An Index-based Approach for Efficient and Effective Web Content Extraction [38.40209116782093]
We introduce Index-based Web Content Extraction.<n>We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query.<n>This method decouples extraction latency from content length, enabling rapid, query-relevant extraction.
arXiv Detail & Related papers (2025-12-07T03:18:19Z) - Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM [35.10225876152952]
We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models.<n>We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors.<n>Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods.
arXiv Detail & Related papers (2025-11-28T12:04:46Z) - AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser [54.623900859999424]
We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem.<n>On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML 81.8% ROUGE-N F1 compared to Trafilatura's 63.6%, with exceptional structured element preservation.
arXiv Detail & Related papers (2025-11-20T14:15:23Z) - SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning [48.376164461507244]
We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
arXiv Detail & Related papers (2025-10-02T09:27:15Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [92.85086256871027]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded.
arXiv Detail & Related papers (2025-06-05T07:12:12Z) - NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction [6.09502686736443]
We introduce a concrete evaluation framework for web data record extraction.<n>Our framework generates evaluation snapshots, annotates supervision labels, and employs structure-aware metrics for consistent scoring.<n>It also incorporates preprocessing to optimize input for Large Language Model (LLM)-based approaches.
arXiv Detail & Related papers (2025-05-21T21:03:37Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.