Related papers: A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

URL: http://arxiv.org/abs/2502.02430v3
Date: Thu, 20 Mar 2025 21:49:15 GMT
Title: A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals
Authors: Róbert Busa-Fekete, Julian Zimmert, András György, Linhai Qiu, Tzu-Wei Sung, Hao Shen, Hyomin Choi, Sharmila Subramaniam, Li Xiao,
Abstract summary: We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval.
Score: 35.53487005950327
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.

Related papers

Web Page Classification using LLMs for Crawling Support [3.370788394696053]
We propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages"<n>We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages.
arXiv Detail & Related papers (2025-05-11T13:07:15Z)
Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed. We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z)
An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees [13.844896723580858]
We introduce a new variant of the gradient-based online caching policy that achieves groundbreaking logarithmic computational complexity. This advancement allows us to test the policy on large-scale, real-world traces featuring millions of requests and items.
arXiv Detail & Related papers (2024-05-02T13:11:53Z)
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website. We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z)
Attention-Enhanced Prioritized Proximal Policy Optimization for Adaptive Edge Caching [4.2579244769567675]
We introduce a Proximal Policy Optimization (PPO)-based caching strategy that fully considers file attributes like lifetime, size, and priority. Our method outperforms a recent Deep Reinforcement Learning-based technique.
arXiv Detail & Related papers (2024-02-08T17:17:46Z)
Optimistic No-regret Algorithms for Discrete Caching [6.182368229968862]
We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning. We provide a universal lower bound for prediction-assisted online caching and design a suite of policies with a range of performance-complexity trade-offs. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only $O(sqrtT)$ regret.
arXiv Detail & Related papers (2022-08-15T09:18:41Z)
Intelligent Request Strategy Design in Recommender System [76.90734681369156]
We envision a new learning task of edge intelligence named Intelligent Request Strategy Design (IRSD) IRSD aims to improve the effectiveness of waterfall RSs by determining the appropriate occasions of request insertion based on users' real-time intention. We propose a new paradigm of adaptive request insertion strategy named Uplift-based On-edge Smart Request Framework (AdaRequest)
arXiv Detail & Related papers (2022-06-23T16:51:38Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling [60.48359567964899]
We consider the problem of scheduling in constrained queueing networks with a view to minimizing packet delay. We use a policy gradient based reinforcement learning algorithm that produces a scheduler that performs better than the available atomic policies.
arXiv Detail & Related papers (2021-05-01T10:18:34Z)
No-Regret Caching via Online Mirror Descent [0.0]
We study an online caching problem in which requests can be served by a local cache to avoid retrieval costs from a remote server. We show that bounds for the regret crucially depend on the diversity of the request process, provided by the diversity ratio R/h. We also prove that, when the cache must store the entire file, rather than a fraction, OMD strategies can be coupled with a randomized rounding scheme that preserves regret guarantees.
arXiv Detail & Related papers (2021-01-29T13:56:51Z)
Online Algorithms for Estimating Change Rates of Web Pages [2.4923006485141284]
finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. These either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per frequencies.
arXiv Detail & Related papers (2020-09-17T08:25:02Z)
Reinforcement Learning for Caching with Space-Time Popularity Dynamics [61.55827760294755]
caching is envisioned to play a critical role in next-generation networks. To intelligently prefetch and store contents, a cache node should be able to learn what and when to cache. This chapter presents a versatile reinforcement learning based approach for near-optimal caching policy design.
arXiv Detail & Related papers (2020-05-19T01:23:51Z)
Change Rate Estimation and Optimal Freshness in Web Page Crawling [2.4923006485141284]
finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. The ideal crawling rates are the ones that maximise the freshness of the local cache. We provide two novel schemes for online estimation of page change rates.
arXiv Detail & Related papers (2020-04-05T11:48:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.