Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
- URL: http://arxiv.org/abs/2511.23119v1
- Date: Fri, 28 Nov 2025 12:04:46 GMT
- Title: Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
- Authors: Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He,
- Abstract summary: We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models.<n>We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors.<n>Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods.
- Score: 35.10225876152952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser [54.623900859999424]
We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem.<n>On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML 81.8% ROUGE-N F1 compared to Trafilatura's 63.6%, with exceptional structured element preservation.
arXiv Detail & Related papers (2025-11-20T14:15:23Z) - SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning [48.376164461507244]
We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
arXiv Detail & Related papers (2025-10-02T09:27:15Z) - NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction [6.09502686736443]
We introduce a concrete evaluation framework for web data record extraction.<n>Our framework generates evaluation snapshots, annotates supervision labels, and employs structure-aware metrics for consistent scoring.<n>It also incorporates preprocessing to optimize input for Large Language Model (LLM)-based approaches.
arXiv Detail & Related papers (2025-05-21T21:03:37Z) - ReaderLM-v2: Small Language Model for HTML to Markdown and JSON [7.9969849952515775]
We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction.<n>Our model processes documents up to 512K messy HTML into clean or markdown formats with high accuracy -- making it an ideal tool for grounding large language models.
arXiv Detail & Related papers (2025-03-03T03:57:04Z) - Only 5\% Attention Is All You Need: Efficient Long-range Document-level
Neural Machine Translation [70.87670058323239]
Document-level Neural Machine Translation (DocNMT) has been proven crucial for handling discourse phenomena by introducing document-level context information.
One of the most important directions is to input the whole document directly to the standard Transformer model.
In this work, we keep the translation performance while gaining 20% speed up by introducing extra selection layer based on lightweight attention that selects a small portion of tokens to be attended.
arXiv Detail & Related papers (2023-09-25T14:33:47Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.