Layout-aware Webpage Quality Assessment
- URL: http://arxiv.org/abs/2301.12152v1
- Date: Sat, 28 Jan 2023 10:27:53 GMT
- Title: Layout-aware Webpage Quality Assessment
- Authors: Anfeng Cheng, Yiding Liu, Weibin Li, Qian Dong, Shuaiqiang Wang,
Zhengjie Huang, Shikun Feng, Zhicong Cheng and Dawei Yin
- Abstract summary: We propose a novel layout-aware webpage quality assessment model currently deployed in our search engine.
We employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model.
To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method.
- Score: 31.537331183733837
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Identifying high-quality webpages is fundamental for real-world search
engines, which can fulfil users' information need with the less cognitive
burden. Early studies of \emph{webpage quality assessment} usually design
hand-crafted features that may only work on particular categories of webpages
(e.g., shopping websites, medical websites). They can hardly be applied to
real-world search engines that serve trillions of webpages with various types
and purposes. In this paper, we propose a novel layout-aware webpage quality
assessment model currently deployed in our search engine. Intuitively, layout
is a universal and critical dimension for the quality assessment of different
categories of webpages. Based on this, we directly employ the meta-data that
describes a webpage, i.e., Document Object Model (DOM) tree, as the input of
our model. The DOM tree data unifies the representation of webpages with
different categories and purposes and indicates the layout of webpages. To
assess webpage quality from complex DOM tree data, we propose a graph neural
network (GNN) based method that extracts rich layout-aware information that
implies webpage quality in an end-to-end manner. Moreover, we improve the GNN
method with an attentive readout function, external web categories and a
category-aware sampling method. We conduct rigorous offline and online
experiments to show that our proposed solution is effective in real search
engines, improving the overall usability and user experience.
Related papers
- IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web [61.96082780724042]
We have curated and aligned a benchmark of images and corresponding web codes (IW-Bench)
We propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree.
We also design a five-hop multimodal Chain-of-Thought Prompting for better performance.
arXiv Detail & Related papers (2024-09-14T05:38:26Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - Learning Context-Aware Representations of Subtrees [0.0]
This thesis tackles the problem of learning efficient representations of complex, structured data with a natural application to web page and element classification.
We hypothesise that the context around the element inside the web page is of high value to the problem and is currently under exploited.
This thesis aims to solve the problem of classifying web elements as subtrees of a DOM tree by also considering their context.
arXiv Detail & Related papers (2021-11-08T07:43:14Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input.
This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z) - GIANT: Scalable Creation of a Web-scale Ontology [29.628181324907295]
We argue that existing knowledge bases and categories fail to discover properly grained concepts, events and topics in the language style of online population.
We present a mechanism to construct a user-centered, web-scale, structured ontology, containing a large number of natural language phrases conforming to user attentions at various granularities.
We present our graph-neural-network-based techniques used in GIANT, and evaluate the proposed methods as compared to a variety of baselines.
arXiv Detail & Related papers (2020-04-05T07:51:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.