PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network
- URL: http://arxiv.org/abs/2305.05378v1
- Date: Tue, 9 May 2023 12:19:10 GMT
- Title: PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network
- Authors: Qiwei Lang, Jingbo Zhou, Haoyi Wang, Shiqi Lyu, Rui Zhang
- Abstract summary: We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
- Score: 19.75890828376791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The number of web pages is growing at an exponential rate, accumulating
massive amounts of data on the web. It is one of the key processes to classify
webpages in web information mining. Some classical methods are based on
manually building features of web pages and training classifiers based on
machine learning or deep learning. However, building features manually requires
specific domain knowledge and usually takes a long time to validate the
validity of features. Considering webpages generated by the combination of text
and HTML Document Object Model(DOM) trees, we propose a representation and
classification method based on a pre-trained language model and graph neural
network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM
trees in the web pages. It performs well on the KI-04 and SWDE datasets and on
practical dataset AHS for the project of scholar's homepage crawling.
Related papers
- Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - Hierarchical Multimodal Pre-training for Visually Rich Webpage
Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages.
We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively.
Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z) - A Real-World WebAgent with Planning, Long Context Understanding, and
Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites.
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites.
We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z) - Towards Zero-shot Relation Extraction in Web Mining: A Multimodal
Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining.
ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree.
It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z) - WebBrain: Learning to Generate Factually Correct Articles for Queries by
Grounding on Large Web Corpus [61.209202634703104]
We introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web.
The ultimate goal is to generate a fluent, informative, and factually-correct short article for a factual query unseen in Wikipedia.
We construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references.
arXiv Detail & Related papers (2023-04-10T02:55:48Z) - GROWN+UP: A Graph Representation Of a Webpage Network Utilizing
Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually.
We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z) - TIE: Topological Information Enhanced Structural Reading Comprehension
on Web Pages [31.291568831285442]
We propose a Topological Information Enhanced model (TIE) to transform the token-level task into a tag-level task.
TIE integrates Graph Attention Network (GAT) and Pre-trained Language Model (PLM) to leverage the information.
Experimental results demonstrate that our model outperforms strong baselines and achieves both logical structures and spatial structures.
arXiv Detail & Related papers (2022-05-13T03:21:09Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.