Related papers: Layout-aware Webpage Quality Assessment

Layout-aware Webpage Quality Assessment

URL: http://arxiv.org/abs/2301.12152v1
Date: Sat, 28 Jan 2023 10:27:53 GMT
Title: Layout-aware Webpage Quality Assessment
Authors: Anfeng Cheng, Yiding Liu, Weibin Li, Qian Dong, Shuaiqiang Wang, Zhengjie Huang, Shikun Feng, Zhicong Cheng and Dawei Yin
Abstract summary: We propose a novel layout-aware webpage quality assessment model currently deployed in our search engine. We employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model. To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method.
Score: 31.537331183733837
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Identifying high-quality webpages is fundamental for real-world search engines, which can fulfil users' information need with the less cognitive burden. Early studies of \emph{webpage quality assessment} usually design hand-crafted features that may only work on particular categories of webpages (e.g., shopping websites, medical websites). They can hardly be applied to real-world search engines that serve trillions of webpages with various types and purposes. In this paper, we propose a novel layout-aware webpage quality assessment model currently deployed in our search engine. Intuitively, layout is a universal and critical dimension for the quality assessment of different categories of webpages. Based on this, we directly employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model. The DOM tree data unifies the representation of webpages with different categories and purposes and indicates the layout of webpages. To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method that extracts rich layout-aware information that implies webpage quality in an end-to-end manner. Moreover, we improve the GNN method with an attentive readout function, external web categories and a category-aware sampling method. We conduct rigorous offline and online experiments to show that our proposed solution is effective in real search engines, improving the overall usability and user experience.

Related papers

Document Quality Scoring for Web Crawling [21.06648177468327]
We use neural estimators of semantic quality for static index pruning to assess semantic quality of web pages in crawling prioritisation tasks. Our software contribution consists of a Docker container that computes an effective quality score for a given web page.
arXiv Detail & Related papers (2025-04-15T09:32:57Z)
MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs [50.274447094978996]
Multi-Page Resource-Aware Webpage (MRWeb) generation task transforms UI designs into multi-page, functional web UIs with internal/external navigation, image loading, and backend routing. Our study applies existing methods to the MRWeb problem using a newly curated dataset of 500 websites (300 synthetic, 200 real-world). Specifically, we identify the best metric to evaluate the similarity of the web UI, assess the impact of the resource list on MRWeb generation, analyze MLLM limitations, and evaluate the effectiveness of the MRWeb tool in real-world.
arXiv Detail & Related papers (2024-12-19T15:02:33Z)
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web [61.96082780724042]
We have curated and aligned a benchmark of images and corresponding web codes (IW-Bench) We propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. We also design a five-hop multimodal Chain-of-Thought Prompting for better performance.
arXiv Detail & Related papers (2024-09-14T05:38:26Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z)
PLM-GNN: A Webpage Classification Method based on Joint Pre-trained Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
Learning Context-Aware Representations of Subtrees [0.0]
This thesis tackles the problem of learning efficient representations of complex, structured data with a natural application to web page and element classification. We hypothesise that the context around the element inside the web page is of high value to the problem and is currently under exploited. This thesis aims to solve the problem of classifying web elements as subtrees of a DOM tree by also considering their context.
arXiv Detail & Related papers (2021-11-08T07:43:14Z)
The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety. We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page. Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z)
CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task. We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree. We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z)
A Graph Representation of Semi-structured Data for Web Question Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
Boilerplate Removal using a Neural Sequence Labeling Model [4.056234173482691]
We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model.
arXiv Detail & Related papers (2020-04-22T08:06:59Z)
GIANT: Scalable Creation of a Web-scale Ontology [29.628181324907295]
We argue that existing knowledge bases and categories fail to discover properly grained concepts, events and topics in the language style of online population. We present a mechanism to construct a user-centered, web-scale, structured ontology, containing a large number of natural language phrases conforming to user attentions at various granularities. We present our graph-neural-network-based techniques used in GIANT, and evaluate the proposed methods as compared to a variety of baselines.
arXiv Detail & Related papers (2020-04-05T07:51:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.