Related papers: AXE: Low-Cost Cross-Domain Web Structured Information Extraction

AXE: Low-Cost Cross-Domain Web Structured Information Extraction

URL: http://arxiv.org/abs/2602.01838v1
Date: Mon, 02 Feb 2026 09:09:35 GMT
Title: AXE: Low-Cost Cross-Domain Web Structured Information Extraction
Authors: Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban,
Abstract summary: AXE is a pipeline that treats the HTML DOM as a tree that needs pruning rather than just a wall of text to be read.<n>AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes.<n>We aim to provide a practical, cost-effective path for large-scale web information extraction.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.

Related papers

Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining [78.36592534300839]
We show that for structured content such as tables and code blocks, extractor choice can significantly impact downstream task performance.<n>This suggests a simple intervention: by taking a Union over different extractors, we can increase the token yield of DCLM-Baseline by up to 71%.
arXiv Detail & Related papers (2026-02-23T06:41:57Z)
ScrapeGraphAI-100k: A Large-Scale Dataset for LLM-Based Web Information Extraction [0.0]
We introduce ScrapeGraphAI-100k, a large-scale dataset of real-world LLM extraction events.<n>Starting from 9M events, we deduplicate and balance by schema to produce 93,695 examples spanning diverse domains.<n>We characterize the datasets structural diversity and its failure modes as schema complexity.
arXiv Detail & Related papers (2026-02-16T20:56:59Z)
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser [54.623900859999424]
We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem.<n>On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML 81.8% ROUGE-N F1 compared to Trafilatura's 63.6%, with exceptional structured element preservation.
arXiv Detail & Related papers (2025-11-20T14:15:23Z)
Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table [49.65586812435899]
XAStruct is a learning-based system capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input.<n>XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table.
arXiv Detail & Related papers (2025-06-13T15:58:05Z)
REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [11.374031643273941]
REXEL is a highly efficient and accurate model for the joint task of document level cIE (DocIE) It is on average 11 times faster than competitive existing approaches in a similar setting. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale.
arXiv Detail & Related papers (2024-04-19T11:04:27Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
Combining Language and Graph Models for Semi-structured Information Extraction on the Web [7.44454462555094]
We present GraphScholarBERT, an open-domain information extraction method based on a joint graph and language model structure. Experiments show that GraphScholarBERT can improve extraction F1 scores by as much as 34.8% compared to previous work in a zero-shot domain and zero-shot website setting.
arXiv Detail & Related papers (2024-02-21T20:53:29Z)
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users. We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set. Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z)
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z)
Feature Extractor Stacking for Cross-domain Few-shot Learning [7.624311495433939]
Cross-domain few-shot learning addresses learning problems where knowledge needs to be transferred from one or more source domains into an instance-scarce target domain with an explicitly different distribution. We propose feature extractor stacking (FES), a new CDFSL method for combining information from a collection of extractors out of the box. We present the basic FES algorithm, which is inspired by the classic stacked generalisation approach, and also introduce two variants: convolutional FES (ConFES) and regularised FES (ReFES)
arXiv Detail & Related papers (2022-05-12T01:54:22Z)
Node Feature Extraction by Self-Supervised Multi-scale Neighborhood Prediction [123.20238648121445]
We propose a new self-supervised learning framework, Graph Information Aided Node feature exTraction (GIANT) GIANT makes use of the eXtreme Multi-label Classification (XMC) formalism, which is crucial for fine-tuning the language model based on graph information. We demonstrate the superior performance of GIANT over the standard GNN pipeline on Open Graph Benchmark datasets.
arXiv Detail & Related papers (2021-10-29T19:55:12Z)
Simplified DOM Trees for Transferable Attribute Extraction from the Web [15.728164692696689]
Given a web page, extracting a structured object along with various attributes of interest can facilitate a variety of downstream applications. Existing approaches formulate the problem as a DOM tree node tagging task. We propose a novel transferable method, SimpDOM, to tackle the problem by efficiently retrieving useful context for each node.
arXiv Detail & Related papers (2021-01-07T07:41:55Z)
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.