SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
- URL: http://arxiv.org/abs/2510.01832v1
- Date: Thu, 02 Oct 2025 09:27:15 GMT
- Title: SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning
- Authors: Shicheng Liu, Kai Sun, Lisheng Fu, Xilun Chen, Xinyuan Zhang, Zhaojiang Lin, Rulin Shao, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong,
- Abstract summary: We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
- Score: 48.376164461507244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM [35.10225876152952]
We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models.<n>We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors.<n>Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods.
arXiv Detail & Related papers (2025-11-28T12:04:46Z) - Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization [1.3537117504260623]
Large Language Models (LLMs) are increasingly integrated into web-based systems for content summarization.<n>This study explores how non-visible HTML elements can be exploited to embed adversarial instructions without altering the visible content of a webpage.
arXiv Detail & Related papers (2025-09-06T21:05:18Z) - Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents.<n>We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
arXiv Detail & Related papers (2025-04-22T04:07:13Z) - HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM [1.104960878651584]
We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages.
The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data.
arXiv Detail & Related papers (2024-09-28T19:58:29Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.