Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline
- URL: http://arxiv.org/abs/2602.24262v1
- Date: Fri, 27 Feb 2026 18:31:42 GMT
- Title: Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline
- Authors: Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh,
- Abstract summary: Existing business databases suffer from substantial coverage gaps.<n>We propose a textbfWeb--Knowledge--Web (W$to$K$to$W) pipeline.<n>It crawls domain-specific web sources to discover candidate supplier entities.<n>It consolidates structured knowledge into a heterogeneous knowledge graph.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.
Related papers
- Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks [0.0]
Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity.<n>We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which Phase I awardees will advance to Phase II funding using exclusively public data.
arXiv Detail & Related papers (2026-02-23T08:35:55Z) - Towards Federated Clustering: A Client-wise Private Graph Aggregation Framework [57.04850867402913]
Federated clustering addresses the challenge of extracting patterns from decentralized, unlabeled data.<n>We propose Structural Privacy-Preserving Federated Graph Clustering (SPP-FGC), a novel algorithm that innovatively leverages local structural graphs as the primary medium for privacy-preserving knowledge sharing.<n>Our framework achieves state-of-the-art performance, improving clustering accuracy by up to 10% (NMI) over federated baselines while maintaining provable privacy guarantees.
arXiv Detail & Related papers (2025-11-14T03:05:22Z) - Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents [70.77400371166922]
Deep research web agents need to rigorously analyze and aggregate knowledge for insightful research.<n>We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents.<n>Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models.
arXiv Detail & Related papers (2025-10-16T08:37:42Z) - LABELING COPILOT: A Deep Research Agent for Automated Data Curation in Computer Vision [13.437102865245285]
We introduce Labeling Copilot, the first data curation deep research agent for computer vision.<n>A central orchestrator agent, powered by a large multimodal language model, uses multi-step reasoning to execute specialized tools across three core capabilities.
arXiv Detail & Related papers (2025-09-26T17:55:26Z) - Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search [54.987957691350665]
Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query.<n>Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications.<n>We propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search.
arXiv Detail & Related papers (2025-08-28T08:51:51Z) - Structural and Connectivity Patterns in the Maven Central Software Dependency Network [0.0]
We investigate the Maven Central ecosystem, one of the largest repositories of Java libraries.<n>We extracted a sample consisting of the top 5,000 highly connected artifacts based on their degree centrality.<n>We conducted a comprehensive analysis of this graph, computing degree distributions, betweenness centrality, PageRank centrality, and connected components graph-theoretic metrics.
arXiv Detail & Related papers (2025-08-19T13:24:46Z) - AGENTICT$^2$S:Robust Text-to-SPARQL via Agentic Collaborative Reasoning over Heterogeneous Knowledge Graphs for the Circular Economy [42.73610751710192]
AgenticT$2$S is a framework that decomposes knowledge graphs into subtasks managed by specialized agents.<n>A two-stage verifier detects structurally invalid and semantically underspecified queries.<n>Experiments on real-world circular economy KGs demonstrate that AgenticT$2$S improves execution accuracy by 17.3%.
arXiv Detail & Related papers (2025-08-03T15:58:54Z) - SNaRe: Domain-aware Data Generation for Low-Resource Event Detection [77.32937742071475]
Event Detection is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology.<n>We introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner.<n>Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list.<n>Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions.
arXiv Detail & Related papers (2025-02-24T18:20:42Z) - Triplètoile: Extraction of Knowledge from Microblogging Text [7.848242781280095]
We propose an enhanced information extraction pipeline tailored to the extraction of a knowledge graph comprising open-domain entities from micro-blogging posts on social media platforms.
Our pipeline leverages dependency parsing and classifies entity relations in an unsupervised manner through hierarchical clustering over word embeddings.
We provide a use case on extracting semantic triples from a corpus of 100 thousand tweets about digital transformation and publicly release the generated knowledge graph.
arXiv Detail & Related papers (2024-08-27T09:35:13Z) - Building A Knowledge Graph to Enrich ChatGPT Responses in Manufacturing Service Discovery [0.5919433278490629]
This study proposes a method that integrates bottom-up ontology with advanced machine learning models to develop a Manufacturing Service Knowledge Graph.
The Knowledge Graph and the learned graph embedding vectors are leveraged to tackle intricate queries within the digital supply chain network.
The approach highlighted is scalable to millions of entities that can be distributed to form a global Manufacturing Service Knowledge Network Graph.
arXiv Detail & Related papers (2024-04-09T18:46:46Z) - How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An
Approach [115.91099791629104]
We construct two new benchmark webly supervised fine-grained datasets, WebFG-496 and WebiNat-5089, respectively.
For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever.
As a minor contribution, we also propose a novel webly supervised method (termed Peer-learning'') for benchmarking these datasets.
arXiv Detail & Related papers (2021-08-05T06:28:32Z) - Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available.
This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets.
We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.