WebFormer: The Web-page Transformer for Structure Information Extraction
- URL: http://arxiv.org/abs/2202.00217v1
- Date: Tue, 1 Feb 2022 04:44:02 GMT
- Title: WebFormer: The Web-page Transformer for Structure Information Extraction
- Authors: Qifan Wang, Yi Fang, Anirudh Ravula, Fuli Feng, Xiaojun Quan, Dongfang
Liu
- Abstract summary: Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
- Score: 44.46531405460861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structure information extraction refers to the task of extracting structured
text fields from web pages, such as extracting a product offer from a shopping
page including product title, description, brand and price. It is an important
research topic which has been widely studied in document understanding and web
search. Recent natural language models with sequence modeling have demonstrated
state-of-the-art performance on web information extraction. However,
effectively serializing tokens from unstructured web pages is challenging in
practice due to a variety of web layout patterns. Limited work has focused on
modeling the web layout for extracting the text fields. In this paper, we
introduce WebFormer, a Web-page transFormer model for structure information
extraction from web documents. First, we design HTML tokens for each DOM node
in the HTML by embedding representations from their neighboring tokens through
graph attention. Second, we construct rich attention patterns between HTML
tokens and text tokens, which leverages the web layout for effective attention
weight computation. We conduct an extensive set of experiments on SWDE and
Common Crawl benchmarks. Experimental results demonstrate the superior
performance of the proposed approach over several state-of-the-art methods.
Related papers
- Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.