Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
- URL: http://arxiv.org/abs/2602.11874v1
- Date: Thu, 12 Feb 2026 12:23:53 GMT
- Title: Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
- Authors: Antoine Gauquier, Ioana Manolescu, Pierre Senellart,
- Abstract summary: SB-CLASSIFIER is a crawler that efficiently learns which hyperlinks lead to pages that link to many targets.<n>We show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
- Score: 4.64103400183613
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
Related papers
- Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory [69.49061918994882]
Branch-and-Browse is a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution.<n>On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8% and reduces execution time by up to 40.4% relative to state-of-the-art methods.
arXiv Detail & Related papers (2025-10-18T00:45:37Z) - Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval [0.0]
This project presents a framework for indexing and analyzing large language training datasets using an ElasticSearch-based pipeline.<n>We apply it to SwissAI's FineWeb-2 corpus, achieving fast query performance--most searches in milliseconds, all under 2 seconds.
arXiv Detail & Related papers (2025-08-29T17:04:20Z) - Craw4LLM: Efficient Web Crawling for LLM Pretraining [45.92222494772196]
Craw4LLM is an efficient web crawling method that explores the web graph based on the preference of LLM pretraining.<n>Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data.
arXiv Detail & Related papers (2025-02-19T00:31:43Z) - WebWalker: Benchmarking LLMs in Web Traversal [64.48425443951749]
We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.<n>We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
arXiv Detail & Related papers (2025-01-13T18:58:07Z) - Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract) [73.57710917145212]
Learning to rank is widely employed in web searches to prioritize pertinent webpages based on input queries.
We propose a emphulineGenerative ulineSemi-ulineSupervised ulinePre-trained (GS2P) model to address these challenges.
We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine.
arXiv Detail & Related papers (2024-09-25T03:39:14Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - GROWN+UP: A Graph Representation Of a Webpage Network Utilizing
Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually.
We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z) - Tree-based Focused Web Crawling with Reinforcement Learning [3.4877567508788134]
A focused crawler aims at discovering as many web pages and web sites relevant to a target topic as possible, while avoiding irrelevant ones.<n>We propose TRES, a novel framework for focused crawling that aims at maximizing both the number of relevant web pages and the number of relevant web sites.
arXiv Detail & Related papers (2021-12-12T00:19:47Z) - Meta Navigator: Search for a Good Adaptation Policy for Few-shot
Learning [113.05118113697111]
Few-shot learning aims to adapt knowledge learned from previous tasks to novel tasks with only a limited amount of labeled data.
Research literature on few-shot learning exhibits great diversity, while different algorithms often excel at different few-shot learning scenarios.
We present Meta Navigator, a framework that attempts to solve the limitation in few-shot learning by seeking a higher-level strategy.
arXiv Detail & Related papers (2021-09-13T07:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.