Document Quality Scoring for Web Crawling
- URL: http://arxiv.org/abs/2504.11011v1
- Date: Tue, 15 Apr 2025 09:32:57 GMT
- Title: Document Quality Scoring for Web Crawling
- Authors: Francesca Pezzuti, Ariane Mueller, Sean MacAvaney, Nicola Tonellotto,
- Abstract summary: We use neural estimators of semantic quality for static index pruning to assess semantic quality of web pages in crawling prioritisation tasks.<n>Our software contribution consists of a Docker container that computes an effective quality score for a given web page.
- Score: 21.06648177468327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The internet contains large amounts of low-quality content, yet users expect web search engines to deliver high-quality, relevant results. The abundant presence of low-quality pages can negatively impact retrieval and crawling processes by wasting resources on these documents. Therefore, search engines can greatly benefit from techniques that leverage efficient quality estimation methods to mitigate these negative impacts. Quality scoring methods for web pages are useful for many processes typical for web search systems, including static index pruning, index tiering, and crawling. Building on work by Chang et al.~\cite{chang2024neural}, who proposed using neural estimators of semantic quality for static index pruning, we extend their approach and apply their neural quality scorers to assess the semantic quality of web pages in crawling prioritisation tasks. In our experimental analysis, we found that prioritising semantically high-quality pages over low-quality ones can improve downstream search effectiveness. Our software contribution consists of a Docker container that computes an effective quality score for a given web page, allowing the quality scorer to be easily included and used in other components of web search systems.
Related papers
- Migrating a Job Search Relevance Function [0.0]
We describe the migration of a homebrewed C++ search engine to OpenSearch, aimed at preserving and improving search performance with minimal impact on business metrics.<n>We froze our job corpus and executed queries in low inventory locations to capture a representative mixture of high- and low-quality search results.<n>We fine-tuned a new retrieval algorithm on OpenSearch, replicating key components of the original engine's logic while introducing new functionality where necessary.
arXiv Detail & Related papers (2025-04-02T01:22:55Z) - Multi-Facet Counterfactual Learning for Content Quality Evaluation [48.73583736357489]
We propose a framework for efficiently constructing evaluators that perceive multiple facets of content quality evaluation.
We leverage a joint training strategy based on contrastive learning and supervised learning to enable the evaluator to distinguish between different quality facets.
arXiv Detail & Related papers (2024-10-10T08:04:10Z) - Neural Passage Quality Estimation for Static Pruning [23.662724916799004]
We explore whether neural networks can effectively predict which of a document's passages are unlikely to be relevant to any query submitted to the search engine.
We find that our novel methods for estimating passage quality allow passage corpora to be pruned considerably.
This work sets the stage for developing more advanced neural "learning-what-to-index" methods.
arXiv Detail & Related papers (2024-07-16T20:47:54Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Comparative analysis of various web crawler algorithms [0.0]
This presentation focuses on the importance of web crawling and page ranking algorithms in dealing with the massive amount of data present on the World Wide Web.
Web crawling is a process that converts unstructured data into structured data, enabling effective information retrieval.
Page ranking algorithms play a significant role in assessing the quality and popularity of web pages.
arXiv Detail & Related papers (2023-06-21T05:27:08Z) - Layout-aware Webpage Quality Assessment [31.537331183733837]
We propose a novel layout-aware webpage quality assessment model currently deployed in our search engine.
We employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model.
To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method.
arXiv Detail & Related papers (2023-01-28T10:27:53Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Generative Models are Unsupervised Predictors of Page Quality: A
Colossal-Scale Study [86.62171568318716]
Large generative language models such as GPT-2 are well-known for their ability to generate text.
We show that unsupervised predictors of "page quality" emerge, able to detect low quality content without any training.
We conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.
arXiv Detail & Related papers (2020-08-17T07:13:24Z) - Mining Implicit Relevance Feedback from User Behavior for Web Question
Answering [92.45607094299181]
We make the first study to explore the correlation between user behavior and passage relevance.
Our approach significantly improves the accuracy of passage ranking without extra human labeled data.
In practice, this work has proved effective to substantially reduce the human labeling cost for the QA service in a global commercial search engine.
arXiv Detail & Related papers (2020-06-13T07:02:08Z) - Web Document Categorization Using Naive Bayes Classifier and Latent
Semantic Analysis [0.7310043452300736]
A rapid growth of web documents necessitates efficient techniques to efficiently classify the document on the web.
We propose a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision.
Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably.
arXiv Detail & Related papers (2020-06-02T15:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.