Web Page Classification using LLMs for Crawling Support
- URL: http://arxiv.org/abs/2505.06972v1
- Date: Sun, 11 May 2025 13:07:15 GMT
- Title: Web Page Classification using LLMs for Crawling Support
- Authors: Yuichi Sasazawa, Yasuhiro Sogawa,
- Abstract summary: We propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages"<n>We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages.
- Score: 3.370788394696053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages," using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both evaluation metrics.
Related papers
- WebWalker: Benchmarking LLMs in Web Traversal [64.48425443951749]
We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.<n>We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
arXiv Detail & Related papers (2025-01-13T18:58:07Z) - MRWeb: An Exploration of Generating Multi-Page Resource-Aware Web Code from UI Designs [50.274447094978996]
Multi-Page Resource-Aware Webpage (MRWeb) generation task transforms UI designs into multi-page, functional web UIs with internal/external navigation, image loading, and backend routing.<n>Our study applies existing methods to the MRWeb problem using a newly curated dataset of 500 websites (300 synthetic, 200 real-world). Specifically, we identify the best metric to evaluate the similarity of the web UI, assess the impact of the resource list on MRWeb generation, analyze MLLM limitations, and evaluate the effectiveness of the MRWeb tool in real-world.
arXiv Detail & Related papers (2024-12-19T15:02:33Z) - Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract) [73.57710917145212]
Learning to rank is widely employed in web searches to prioritize pertinent webpages based on input queries.
We propose a emphulineGenerative ulineSemi-ulineSupervised ulinePre-trained (GS2P) model to address these challenges.
We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine.
arXiv Detail & Related papers (2024-09-25T03:39:14Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - List-aware Reranking-Truncation Joint Model for Search and
Retrieval-augmented Generation [80.12531449946655]
We propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently.
GenRT integrates reranking and truncation via generative paradigm based on encoder-decoder architecture.
Our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
arXiv Detail & Related papers (2024-02-05T06:52:53Z) - Context-Aware Classification of Legal Document Pages [7.306025535482021]
We present a simple but effective approach that overcomes the constraint on input length.
Specifically, we enhance the input with extra tokens carrying sequential information about previous pages.
Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
arXiv Detail & Related papers (2023-04-05T23:14:58Z) - Layout-aware Webpage Quality Assessment [31.537331183733837]
We propose a novel layout-aware webpage quality assessment model currently deployed in our search engine.
We employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model.
To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method.
arXiv Detail & Related papers (2023-01-28T10:27:53Z) - Page Segmentation using Visual Adjacency Analysis [5.9521013526545925]
We propose a novel page segmentation approach based on visual analysis of localized adjacency regions.
It combines DOM attributes and visual analysis to build features of a given page and guide an unsupervised clustering.
We evaluate our approach on 35 real-world web pages, and examine the effectiveness and efficiency of segmentation.
arXiv Detail & Related papers (2021-12-11T00:20:30Z) - Prediction of new outlinks for focused Web crawling [0.0]
This work provides a methodology for detecting new links effectively using a short history.
We provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links.
A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page.
arXiv Detail & Related papers (2021-11-09T11:36:21Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.