Prediction of new outlinks for focused Web crawling
- URL: http://arxiv.org/abs/2111.05062v2
- Date: Wed, 10 Nov 2021 20:33:34 GMT
- Title: Prediction of new outlinks for focused Web crawling
- Authors: Thi Kim Nhung Dang (1), Doina Bucur (1), Berk Atil (2), Guillaume
Pitel (3), Frank Ruis (1), Hamidreza Kadkhodaei (1), and Nelly Litvak (1 and
4) ((1) University of Twente, The Netherlands, (2) Bogazici University,
Turkey, (3) Exensa, France, (4) Eindhoven University of Technology, The
Netherlands)
- Abstract summary: This work provides a methodology for detecting new links effectively using a short history.
We provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links.
A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Discovering new hyperlinks enables Web crawlers to find new pages that have
not yet been indexed. This is especially important for focused crawlers because
they strive to provide a comprehensive analysis of specific parts of the Web,
thus prioritizing discovery of new pages over discovery of changes in content.
In the literature, changes in hyperlinks and content have been usually
considered simultaneously. However, there is also evidence suggesting that
these two types of changes are not necessarily related. Moreover, many studies
about predicting changes assume that long history of a page is available, which
is unattainable in practice. The aim of this work is to provide a methodology
for detecting new links effectively using a short history. To this end, we use
a dataset of ten crawls at intervals of one week. Our study consists of three
parts. First, we obtain insight in the data by analyzing empirical properties
of the number of new outlinks. We observe that these properties are, on
average, stable over time, but there is a large difference between emergence of
hyperlinks towards pages within and outside the domain of a target page
(internal and external outlinks, respectively). Next, we provide statistical
models for three targets: the link change rate, the presence of new links, and
the number of new links. These models include the features used earlier in the
literature, as well as new features introduced in this work. We analyze
correlation between the features, and investigate their informativeness. A
notable finding is that, if the history of the target page is not available,
then our new features, that represent the history of related pages, are most
predictive for new links in the target page. Finally, we propose ranking
methods as guidelines for focused crawlers to efficiently discover new pages,
which achieve excellent performance with respect to the corresponding targets.
Related papers
- Query-oriented Data Augmentation for Session Search [71.84678750612754]
We propose query-oriented data augmentation to enrich search logs and empower the modeling.
We generate supplemental training pairs by altering the most important part of a search context.
We develop several strategies to alter the current query, resulting in new training data with varying degrees of difficulty.
arXiv Detail & Related papers (2024-07-04T08:08:33Z) - Directed Criteria Citation Recommendation and Ranking Through Link Prediction [0.32885740436059047]
Our model uses transformer-based graph embeddings to encode the meaning of each document, presented as a node within a citation network.
We show that the semantic representations that our model generates can outperform other content-based methods in recommendation and ranking tasks.
arXiv Detail & Related papers (2024-03-18T20:47:38Z) - Revisiting Link Prediction: A Data Perspective [61.52668130971441]
Link prediction, a fundamental task on graphs, has proven indispensable in various applications, e.g., friend recommendation, protein analysis, and drug interaction prediction.
Evidence in existing literature underscores the absence of a universally best algorithm suitable for all datasets.
We recognize three fundamental factors critical to link prediction: local structural proximity, global structural proximity, and feature proximity.
arXiv Detail & Related papers (2023-10-01T21:09:59Z) - Anchor Prediction: Automatic Refinement of Internet Links [25.26235117917374]
We introduce the task of anchor prediction.
The goal is to identify the specific part of the linked target webpage that is most related to the source linking context.
We release the AuthorAnchors dataset, a collection of 34K naturally-occurring anchored links.
arXiv Detail & Related papers (2023-05-23T17:58:21Z) - Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? [19.862211305690916]
We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training.
Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
arXiv Detail & Related papers (2022-09-14T12:03:31Z) - Twitter Referral Behaviours on News Consumption with Ensemble Clustering
of Click-Stream Data in Turkish Media [2.9005223064604078]
This study investigates the readers' click activities in the organizations' websites to identify news consumption patterns following referrals from Twitter.
The investigation is widened to a broad perspective by linking the log data with news content to enrich the insights.
arXiv Detail & Related papers (2022-02-04T09:57:13Z) - Predicting Links on Wikipedia with Anchor Text Information [0.571097144710995]
We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
arXiv Detail & Related papers (2021-05-25T07:57:57Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z) - ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured
Webpages [66.45377533562417]
We propose a solution for "zero-shot" open-domain relation extraction from webpages with a previously unseen template.
Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage.
arXiv Detail & Related papers (2020-05-14T16:15:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.