ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages
- URL: http://arxiv.org/abs/2005.07105v1
- Date: Thu, 14 May 2020 16:15:58 GMT
- Title: ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages
- Authors: Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi
- Abstract summary: We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
- Score: 66.45377533562417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many documents, such as semi-structured webpages, textual semantics are
augmented with additional information conveyed using visual elements including
layout, font size, and color. Prior work on information extraction from
semi-structured websites has required learning an extraction model specific to
a given template via either manually labeled or distantly supervised data from
that template. In this work, we propose a solution for "zero-shot" open-domain
relation extraction from webpages with a previously unseen template, including
from websites with little overlap with existing sources of knowledge for
distant supervision and websites in entirely new subject verticals. Our model
uses a graph neural network-based approach to build a rich representation of
text fields on a webpage and the relationships between them, enabling
generalization to new templates. Experiments show this approach provides a 31%
F1 gain over a baseline for zero-shot extraction in a new subject vertical.
Related papers
- Combining Language and Graph Models for Semi-structured Information
Extraction on the Web [7.44454462555094]
We present GraphScholarBERT, an open-domain information extraction method based on a joint graph and language model structure.
Experiments show that GraphScholarBERT can improve extraction F1 scores by as much as 34.8% compared to previous work in a zero-shot domain and zero-shot website setting.
arXiv Detail & Related papers (2024-02-21T20:53:29Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - FreeDOM: A Transferable Neural Architecture for Structured Information
Extraction on Web Documents [16.101638575566444]
FreeDOM learns a representation for each DOM node in the page by combining both the text and markup information.
The first stage learns a representation for each DOM node in the page by combining both the text and markup information.
The second stage captures longer range distance and semantic relatedness using a relational neural network.
arXiv Detail & Related papers (2020-10-21T04:20:13Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.