Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path
- URL: http://arxiv.org/abs/2305.13805v1
- Date: Tue, 23 May 2023 08:16:52 GMT
- Title: Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path
- Authors: Zilong Wang, Jingbo Shang
- Abstract summary: We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
- Score: 28.898240725099782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid growth of web pages and the increasing complexity of their
structure poses a challenge for web mining models. Web mining models are
required to understand the semi-structured web pages, particularly when little
is known about the subject or template of a new page. Current methods migrate
language models to the web mining by embedding the XML source code into the
transformer or encoding the rendered layout with graph neural networks.
However, these approaches do not take into account the relationships between
text nodes within and across pages. In this paper, we propose a new approach,
ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the
shortest relative paths in the Document Object Model (DOM) tree which is a more
accurate and efficient signal for key-value pair extraction within a web page.
It also incorporates the popularity of each text node by counting the
occurrence of the same text node across different web pages. We use the
contrastive learning to address the issue of sparsity in relation extraction.
Extensive experiments on public benchmarks show that our method, ReXMiner,
outperforms the state-of-the-art baselines in the task of zero-shot relation
extraction in web mining.
Related papers
- AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - Web Page Content Extraction Based on Multi-feature Fusion [20.214440785390984]
This paper proposes a web page text extraction algorithm based on multi-feature fusion.
It takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, and adapts to more types of pages.
Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.
arXiv Detail & Related papers (2022-03-21T04:26:51Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - Modeling Graph Structure via Relative Position for Text Generation from
Knowledge Graphs [54.176285420428776]
We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation.
With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns.
Graformer learns to weight these node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph.
arXiv Detail & Related papers (2020-06-16T15:20:04Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z) - Heterogeneous Graph Neural Networks for Extractive Document
Summarization [101.17980994606836]
Cross-sentence relations are a crucial step in extractive document summarization.
We present a graph-based neural network for extractive summarization (HeterSumGraph)
We introduce different types of nodes into graph-based neural networks for extractive document summarization.
arXiv Detail & Related papers (2020-04-26T14:38:11Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.