Change Rate Estimation and Optimal Freshness in Web Page Crawling
- URL: http://arxiv.org/abs/2004.02167v1
- Date: Sun, 5 Apr 2020 11:48:38 GMT
- Title: Change Rate Estimation and Optimal Freshness in Web Page Crawling
- Authors: Konstantin Avrachenkov, Kishor Patil, Gugan Thoppe
- Abstract summary: finite bandwidth availability and server restrictions impose some constraints on the crawling frequency.
The ideal crawling rates are the ones that maximise the freshness of the local cache.
We provide two novel schemes for online estimation of page change rates.
- Score: 2.4923006485141284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For providing quick and accurate results, a search engine maintains a local
snapshot of the entire web. And, to keep this local cache fresh, it employs a
crawler for tracking changes across various web pages. However, finite
bandwidth availability and server restrictions impose some constraints on the
crawling frequency. Consequently, the ideal crawling rates are the ones that
maximise the freshness of the local cache and also respect the above
constraints. Azar et al. 2018 recently proposed a tractable algorithm to solve
this optimisation problem. However, they assume the knowledge of the exact page
change rates, which is unrealistic in practice. We address this issue here.
Specifically, we provide two novel schemes for online estimation of page change
rates. Both schemes only need partial information about the page change
process, i.e., they only need to know if the page has changed or not since the
last crawled instance. For both these schemes, we prove convergence and, also,
derive their convergence rates. Finally, we provide some numerical experiments
to compare the performance of our proposed estimators with the existing ones
(e.g., MLE).
Related papers
- Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed.
We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z) - Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache.
Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit.
We show that this one-size-fits-all threshold is insufficient across different embeddings.
We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z) - A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals [35.53487005950327]
We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval.
arXiv Detail & Related papers (2025-02-04T15:55:10Z) - Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters.
We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers.
Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z) - Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z) - Improved Online Conformal Prediction via Strongly Adaptive Online
Learning [86.4346936885507]
We develop new online conformal prediction methods that minimize the strongly adaptive regret.
We prove that our methods achieve near-optimal strongly adaptive regret for all interval lengths simultaneously.
Experiments show that our methods consistently obtain better coverage and smaller prediction sets than existing methods on real-world tasks.
arXiv Detail & Related papers (2023-02-15T18:59:30Z) - MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction [0.709016563801433]
MUSTACHE is a new page cache replacement whose logic is learned from observed memory access requests rather than fixed like existing policies.
We formulate the page request prediction problem as a categorical time series forecasting task.
Our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B'el'ady's replacement algorithm.
arXiv Detail & Related papers (2022-11-03T23:10:21Z) - Online Caching with no Regret: Optimistic Learning via Recommendations [15.877673959068458]
We build upon the Follow-the-Regularized-Leader (FTRL) framework to include predictions for the file requests.
We extend the framework to learn and utilize the best request predictor in cases where many are available.
We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions.
arXiv Detail & Related papers (2022-04-20T09:29:47Z) - Continual Test-Time Domain Adaptation [94.51284735268597]
Test-time domain adaptation aims to adapt a source pre-trained model to a target domain without using any source data.
CoTTA is easy to implement and can be readily incorporated in off-the-shelf pre-trained models.
arXiv Detail & Related papers (2022-03-25T11:42:02Z) - Online Caching with Optimistic Learning [15.877673959068458]
This paper proposes a new algorithmic toolbox for tackling this problem through the lens of optimistic online learning.
We design online caching algorithms for bipartite networks with fixed-size caches or elastic leased caches subject to time-average budget constraints.
We prove that the proposed optimistic learning caching policies can achieve sub-zero performance loss (regret) for perfect predictions, and maintain the best achievable regret bound $O(sqrt T)$ even for arbitrary-bad predictions.
arXiv Detail & Related papers (2022-02-22T00:04:30Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Learning from Images: Proactive Caching with Parallel Convolutional
Neural Networks [94.85780721466816]
A novel framework for proactive caching is proposed in this paper.
It combines model-based optimization with data-driven techniques by transforming an optimization problem into a grayscale image.
Numerical results show that the proposed scheme can reduce 71.6% computation time with only 0.8% additional performance cost.
arXiv Detail & Related papers (2021-08-15T21:32:47Z) - Online Algorithms for Estimating Change Rates of Web Pages [2.4923006485141284]
finite bandwidth availability and server restrictions limit how frequently different pages can be crawled.
These either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same.
We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per frequencies.
arXiv Detail & Related papers (2020-09-17T08:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.