Related papers: PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network

URL: http://arxiv.org/abs/2305.05378v1
Date: Tue, 9 May 2023 12:19:10 GMT
Title: PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network
Authors: Qiwei Lang, Jingbo Zhou, Haoyi Wang, Shiqi Lyu, Rui Zhang
Abstract summary: We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
Score: 19.75890828376791
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.

Related papers

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation [129.27104172458363]
We develop a framework for organizing web pages in terms of both their topic and format. We automatically annotate pre-training data by distilling annotations from a large language model into efficient curations. Our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods.
arXiv Detail & Related papers (2025-02-14T18:02:37Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding [22.00873805952277]
WebLM is a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages. We propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively. Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks.
arXiv Detail & Related papers (2024-02-28T11:50:36Z)
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis [69.15016747150868]
We introduce WebAgent, an agent that learns from self-experience to complete tasks on real websites. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks.
arXiv Detail & Related papers (2023-07-24T14:56:30Z)
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z)
WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus [61.209202634703104]
We introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web. The ultimate goal is to generate a fluent, informative, and factually-correct short article for a factual query unseen in Wikipedia. We construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references.
arXiv Detail & Related papers (2023-04-10T02:55:48Z)
GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z)
TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages [31.291568831285442]
We propose a Topological Information Enhanced model (TIE) to transform the token-level task into a tag-level task. TIE integrates Graph Attention Network (GAT) and Pre-trained Language Model (PLM) to leverage the information. Experimental results demonstrate that our model outperforms strong baselines and achieves both logical structures and spatial structures.
arXiv Detail & Related papers (2022-05-13T03:21:09Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety. We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page. Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z)
GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models. With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow. In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z)
A Graph Representation of Semi-structured Data for Web Question Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.