FreeDOM: A Transferable Neural Architecture for Structured Information
Extraction on Web Documents
- URL: http://arxiv.org/abs/2010.10755v1
- Date: Wed, 21 Oct 2020 04:20:13 GMT
- Title: FreeDOM: A Transferable Neural Architecture for Structured Information
Extraction on Web Documents
- Authors: Bill Yuchen Lin, Ying Sheng, Nguyen Vo, Sandeep Tata
- Abstract summary: FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information.
The first stage learns a representation for each DOM node in the page by combining both the text and markup information.
The second stage captures longer range distance and semantic relatedness using a relational neural network.
- Score: 16.101638575566444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting structured data from HTML documents is a long-studied problem with
a broad range of applications like augmenting knowledge bases, supporting
faceted search, and providing domain-specific experiences for key verticals
like shopping and movies. Previous approaches have either required a small
number of examples for each target site or relied on carefully handcrafted
heuristics built over visual renderings of websites. In this paper, we present
a novel two-stage neural approach, named FreeDOM, which overcomes both these
limitations. The first stage learns a representation for each DOM node in the
page by combining both the text and markup information. The second stage
captures longer range distance and semantic relatedness using a relational
neural network. By combining these stages, FreeDOM is able to generalize to
unseen sites after training on a small number of seed sites from that vertical
without requiring expensive hand-crafted features over visual renderings of the
page. Through experiments on a public dataset with 8 different verticals, we
show that FreeDOM beats the previous state of the art by nearly 3.7 F1 points
on average without requiring features over rendered pages or expensive
hand-crafted features.
Related papers
- PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - Simplified DOM Trees for Transferable Attribute Extraction from the Web [15.728164692696689]
Given a web page, extracting a structured object along with various attributes of interest can facilitate a variety of downstream applications.
Existing approaches formulate the problem as a DOM tree node tagging task.
We propose a novel transferable method, SimpDOM, to tackle the problem by efficiently retrieving useful context for each node.
arXiv Detail & Related papers (2021-01-07T07:41:55Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.