Related papers: Simplified DOM Trees for Transferable Attribute Extraction from the Web

Simplified DOM Trees for Transferable Attribute Extraction from the Web

URL: http://arxiv.org/abs/2101.02415v1
Date: Thu, 7 Jan 2021 07:41:55 GMT
Title: Simplified DOM Trees for Transferable Attribute Extraction from the Web
Authors: Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, Sandeep Tata
Abstract summary: Given a web page, extracting a structured object along with various attributes of interest can facilitate a variety of downstream applications. Existing approaches formulate the problem as a DOM tree node tagging task. We propose a novel transferable method, SimpDOM, to tackle the problem by efficiently retrieving useful context for each node.
Score: 15.728164692696689
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There has been a steady need to precisely extract structured knowledge from the web (i.e. HTML documents). Given a web page, extracting a structured object along with various attributes of interest (e.g. price, publisher, author, and genre for a book) can facilitate a variety of downstream applications such as large-scale knowledge base construction, e-commerce product search, and personalized recommendation. Considering each web page is rendered from an HTML DOM tree, existing approaches formulate the problem as a DOM tree node tagging task. However, they either rely on computationally expensive visual feature engineering or are incapable of modeling the relationship among the tree nodes. In this paper, we propose a novel transferable method, Simplified DOM Trees for Attribute Extraction (SimpDOM), to tackle the problem by efficiently retrieving useful context for each node by leveraging the tree structure. We study two challenging experimental settings: (i) intra-vertical few-shot extraction, and (ii) cross-vertical fewshot extraction with out-of-domain knowledge, to evaluate our approach. Extensive experiments on the SWDE public dataset show that SimpDOM outperforms the state-of-the-art (SOTA) method by 1.44% on the F1 score. We also find that utilizing knowledge from a different vertical (cross-vertical extraction) is surprisingly useful and helps beat the SOTA by a further 1.37%.

Related papers

ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval [64.44265315244579]
We propose a tree-based method for organizing and representing reference documents at various granular levels. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches. Our evaluations show that ReTreever generally preserves full representation accuracy.
arXiv Detail & Related papers (2025-02-11T21:35:13Z)
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users. We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set. Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z)
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z)
Searching a High-Performance Feature Extractor for Text Recognition Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors. As the space is huge and complexly structured, no existing NAS algorithms can be applied. We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z)
Modeling Multi-Granularity Hierarchical Features for Relation Extraction [26.852869800344813]
We propose a novel method to extract multi-granularity features based solely on the original input sentences. We show that effective structured features can be attained even without external knowledge.
arXiv Detail & Related papers (2022-04-09T09:44:05Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task. We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z)
ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.