ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
- URL: http://arxiv.org/abs/2602.15189v1
- Date: Mon, 16 Feb 2026 20:56:59 GMT
- Title: ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction
- Authors: William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan,
- Abstract summary: We introduce ScrapeGraphAI-100k, a large-scale dataset of real-world LLM extraction events.<n>Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains.<n>We characterize the datasets structural diversity and its failure modes as schema complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of large language models for web information extraction is becoming increasingly fundamental to modern web information retrieval pipelines. However, existing datasets tend to be small, synthetic or text-only, failing to capture the structural context of the web. We introduce ScrapeGraphAI-100k, a large-scale dataset comprising real-world LLM extraction events, collected via opt-in ScrapeGraphAI telemetry during Q2 and Q3 of 2025. Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains and languages. Each instance includes Markdown content, a prompt, a JSON schema, the LLM response, and complexity/validation metadata. We characterize the datasets structural diversity and its failure modes as schema complexity increases. We also provide a fine-tuning experiment showing that a small language model (1.7B) trained on a subset narrows the gap to larger baselines (30B), underscoring the datasets utility for efficient extraction. ScrapeGraphAI-100k enables fine-tuning small models, benchmarking structured extraction, and studying schema induction for web IR indexing, and is publicly available on HuggingFace.
Related papers
- Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [48.73595915402094]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Distantly Supervised Morpho-Syntactic Model for Relation Extraction [0.27195102129094995]
We present a method for the extraction and categorisation of an unrestricted set of relationships from text.
We evaluate our approach on six datasets built on Wikidata and Wikipedia.
arXiv Detail & Related papers (2024-01-18T14:17:40Z) - Leveraging Large Language Models for Node Generation in Few-Shot Learning on Text-Attributed Graphs [5.587264586806575]
We propose a plug-and-play approach to empower text-attributed graphs through node generation using Large Language Models (LLMs)<n>LLMs extract semantic information from labels and generate samples that belong to categories as exemplars.<n>We employ an edge predictor to capture structural information inherent in the raw dataset and integrate the newly generated samples into the original graph.
arXiv Detail & Related papers (2023-10-15T16:04:28Z) - DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion
Models [61.906934570771256]
We present a generic dataset generation model that can produce diverse synthetic images and perception annotations.
Our method builds upon the pre-trained diffusion model and extends text-guided image synthesis to perception data generation.
We show that the rich latent code of the diffusion model can be effectively decoded as accurate perception annotations using a decoder module.
arXiv Detail & Related papers (2023-08-11T14:38:11Z) - PLAtE: A Large-scale Dataset for List Page Web Extraction [19.92099953576541]
PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset.
We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
arXiv Detail & Related papers (2022-05-24T22:26:58Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - Data Augmentation for Abstractive Query-Focused Multi-Document
Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods.
These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries.
We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z) - WebRED: Effective Pretraining And Finetuning For Relation Extraction On
The Web [4.702325864333419]
WebRED is a strongly-supervised human annotated dataset for extracting relationships from text found on the World Wide Web.
We show that combining pre-training on a large weakly supervised dataset with fine-tuning on a small strongly-supervised dataset leads to better relation extraction performance.
arXiv Detail & Related papers (2021-02-18T23:56:12Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.