Simplified DOM Trees for Transferable Attribute Extraction from the Web
- URL: http://arxiv.org/abs/2101.02415v1
- Date: Thu, 7 Jan 2021 07:41:55 GMT
- Title: Simplified DOM Trees for Transferable Attribute Extraction from the Web
- Authors: Yichao Zhou, Ying Sheng, Nguyen Vo, Nick Edmonds, Sandeep Tata
- Abstract summary: Given a web page, extracting a structured object along with various attributes of interest can facilitate a variety of downstream applications.
Existing approaches formulate the problem as a DOM tree node tagging task.
We propose a novel transferable method, SimpDOM, to tackle the problem by efficiently retrieving useful context for each node.
- Score: 15.728164692696689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been a steady need to precisely extract structured knowledge from
the web (i.e. HTML documents). Given a web page, extracting a structured object
along with various attributes of interest (e.g. price, publisher, author, and
genre for a book) can facilitate a variety of downstream applications such as
large-scale knowledge base construction, e-commerce product search, and
personalized recommendation. Considering each web page is rendered from an HTML
DOM tree, existing approaches formulate the problem as a DOM tree node tagging
task. However, they either rely on computationally expensive visual feature
engineering or are incapable of modeling the relationship among the tree nodes.
In this paper, we propose a novel transferable method, Simplified DOM Trees for
Attribute Extraction (SimpDOM), to tackle the problem by efficiently retrieving
useful context for each node by leveraging the tree structure. We study two
challenging experimental settings: (i) intra-vertical few-shot extraction, and
(ii) cross-vertical fewshot extraction with out-of-domain knowledge, to
evaluate our approach. Extensive experiments on the SWDE public dataset show
that SimpDOM outperforms the state-of-the-art (SOTA) method by 1.44% on the F1
score. We also find that utilizing knowledge from a different vertical
(cross-vertical extraction) is surprisingly useful and helps beat the SOTA by a
further 1.37%.
Related papers
- Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z) - Searching a High-Performance Feature Extractor for Text Recognition
Network [92.12492627169108]
We design a domain-specific search space by exploring principles for having good feature extractors.
As the space is huge and complexly structured, no existing NAS algorithms can be applied.
We propose a two-stage algorithm to effectively search in the space.
arXiv Detail & Related papers (2022-09-27T03:49:04Z) - Modeling Multi-Granularity Hierarchical Features for Relation Extraction [26.852869800344813]
We propose a novel method to extract multi-granularity features based solely on the original input sentences.
We show that effective structured features can be attained even without external knowledge.
arXiv Detail & Related papers (2022-04-09T09:44:05Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - FreeDOM: A Transferable Neural Architecture for Structured Information
Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information.
The first stage learns a representation for each DOM node in the page by combining both the text and markup information.
The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.