Related papers: Online Algorithms for Estimating Change Rates of Web Pages

Online Algorithms for Estimating Change Rates of Web Pages

URL: http://arxiv.org/abs/2009.08142v2
Date: Thu, 4 Nov 2021 19:59:28 GMT
Title: Online Algorithms for Estimating Change Rates of Web Pages
Authors: Konstantin Avrachenkov, Kishor Patil, Gugan Thoppe
Abstract summary: finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. These either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per frequencies.
Score: 2.4923006485141284
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: A search engine maintains local copies of different web pages to provide quick search results. This local cache is kept up-to-date by a web crawler that frequently visits these different pages to track changes in them. Ideally, the local copy should be updated as soon as a page changes on the web. However, finite bandwidth availability and server restrictions limit how frequently different pages can be crawled. This brings forth the following optimization problem: maximize the freshness of the local cache subject to the crawling frequencies being within prescribed bounds. While tractable algorithms do exist to solve this problem, these either assume the knowledge of exact page change rates or use inefficient methods such as MLE for estimating the same. We address this issue here. We provide three novel schemes for online estimation of page change rates, all of which have extremely low running times per iteration. The first is based on the law of large numbers and the second on stochastic approximation. The third is an extension of the second and includes a heavy-ball momentum term. All these schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. Our main theoretical results concern asymptotic convergence and convergence rates of these three schemes. In fact, our work is the first to show convergence of the original stochastic heavy-ball method when neither the gradient nor the noise variance is uniformly bounded. We also provide some numerical experiments (based on real and synthetic data) to demonstrate the superiority of our proposed estimators over existing ones such as MLE. We emphasize that our algorithms are also readily applicable to the synchronization of databases and network inventory management.

Related papers

Efficient and Optimal No-Regret Caching under Partial Observation [11.537072761243344]
We study the caching problem in a more restrictive setting where only a fraction of past requests are observed. We propose a randomized caching policy with sublinear regret based on classic online learning algorithm Follow-the-Perturbed-Leader.
arXiv Detail & Related papers (2025-03-04T16:21:33Z)
Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different embeddings. We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z)
A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals [35.53487005950327]
We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval.
arXiv Detail & Related papers (2025-02-04T15:55:10Z)
Online Weighted Paging with Unknown Weights [15.525560291268219]
We present the first algorithm for online weighted paging that does not know page weights in advance, but rather learns from weight samples. We believe that our work can inspire online algorithms to other problems that involve cost sampling.
arXiv Detail & Related papers (2024-10-28T17:57:40Z)
Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation [50.70671908078593]
This paper proposes a novel training paradigm, learning towards Synchronous Memorizability and Generalizability (SMG-Learning) We create the orientational gradient alignment to ensure memorizability on previous sites, and arbitrary gradient alignment to enhance generalizability on unseen sites. Experimental results show that our method efficiently enhances both memorizability and generalizablity better than other state-of-the-art methods.
arXiv Detail & Related papers (2024-06-26T03:10:57Z)
Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making [66.27188304203217]
Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning. Prior attempts to define such temporal distances in settings have been stymied by an important limitation. We show how successor features learned by contrastive learning form a temporal distance that does satisfy the triangle inequality.
arXiv Detail & Related papers (2024-06-24T19:36:45Z)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching [56.286064975443026]
We make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through a caching mechanism, can be readily removed even without updating the model parameters. We introduce a novel scheme, named Learningto-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-r, alongside prior cache-based methods at the same inference speed.
arXiv Detail & Related papers (2024-06-03T18:49:57Z)
Multi-Agent Reinforcement Learning with Hierarchical Coordination for Emergency Responder Stationing [8.293120269016834]
An emergency responder management (ERM) system dispatches responders when it receives requests for medical aid. ERM systems can proactively reposition responders between predesignated waiting locations to cover any gaps. The state-of-the-art approach in proactive repositioning is a hierarchical approach based on spatial decomposition and online Monte Carlo tree search. We introduce a novel reinforcement learning (RL) approach, based on the same hierarchical decomposition, but replacing online search with learning.
arXiv Detail & Related papers (2024-05-21T21:15:45Z)
BayOTIDE: Bayesian Online Multivariate Time series Imputation with functional decomposition [31.096125530322933]
In real-world scenarios like traffic and energy, massive time-series data with missing values and noises are widely observed, even sampled irregularly. While many imputation methods have been proposed, most of them work with a local horizon. Almost all methods assume the observations are sampled at regular time stamps, and fail to handle complex irregular sampled time series.
arXiv Detail & Related papers (2023-08-28T21:17:12Z)
Dynamic Graph Learning Based on Hierarchical Memory for Origin-Destination Demand Prediction [12.72319550363076]
This paper provides a dynamic graph representation learning framework for OD demands prediction. In particular, a hierarchical memory updater is first proposed to maintain a time-aware representation for each node. Atemporal propagation mechanism is provided to aggregate representations of neighbor nodes along a randomtemporal route. An objective function is designed to derive the future OD demands according to the most recent node.
arXiv Detail & Related papers (2022-05-29T07:52:35Z)
Continual Test-Time Domain Adaptation [94.51284735268597]
Test-time domain adaptation aims to adapt a source pre-trained model to a target domain without using any source data. CoTTA is easy to implement and can be readily incorporated in off-the-shelf pre-trained models.
arXiv Detail & Related papers (2022-03-25T11:42:02Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
Learning to Accelerate Heuristic Searching for Large-Scale Maximum Weighted b-Matching Problems in Online Advertising [51.97494906131859]
Bipartite b-matching is fundamental in algorithm design, and has been widely applied into economic markets, labor markets, etc. Existing exact and approximate algorithms usually fail in such settings due to either requiring intolerable running time or too much computation resource. We propose textttNeuSearcher which leverages the knowledge learned from previously instances to solve new problem instances.
arXiv Detail & Related papers (2020-05-09T02:48:23Z)
Change Rate Estimation and Optimal Freshness in Web Page Crawling [2.4923006485141284]
finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. The ideal crawling rates are the ones that maximise the freshness of the local cache. We provide two novel schemes for online estimation of page change rates.
arXiv Detail & Related papers (2020-04-05T11:48:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.