Predicting Links on Wikipedia with Anchor Text Information
- URL: http://arxiv.org/abs/2105.11734v1
- Date: Tue, 25 May 2021 07:57:57 GMT
- Title: Predicting Links on Wikipedia with Anchor Text Information
- Authors: Robin Brochier, Fr\'ed\'eric B\'echet
- Abstract summary: We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
- Score: 0.571097144710995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wikipedia, the largest open-collaborative online encyclopedia, is a corpus of
documents bound together by internal hyperlinks. These links form the building
blocks of a large network whose structure contains important information on the
concepts covered in this encyclopedia. The presence of a link between two
articles, materialised by an anchor text in the source page pointing to the
target page, can increase readers' understanding of a topic. However, the
process of linking follows specific editorial rules to avoid both under-linking
and over-linking. In this paper, we study the transductive and the inductive
tasks of link prediction on several subsets of the English Wikipedia and
identify some key challenges behind automatic linking based on anchor text
information. We propose an appropriate evaluation sampling methodology and
compare several algorithms. Moreover, we propose baseline models that provide a
good estimation of the overall difficulty of the tasks.
Related papers
- DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Reranking Passages with Coarse-to-Fine Neural Retriever Enhanced by List-Context Information [0.9463895540925061]
This paper presents a list-context attention mechanism to augment the passage representation by incorporating the list-context information from other candidates.
The proposed coarse-to-fine (C2F) neural retriever addresses the out-of-memory limitation of the passage attention mechanism.
It integrates the coarse and fine rankers into the joint optimization process, allowing for feedback between the two layers to update the model simultaneously.
arXiv Detail & Related papers (2023-08-23T09:29:29Z) - Anchor Prediction: Automatic Refinement of Internet Links [25.26235117917374]
We introduce the task of anchor prediction.
The goal is to identify the specific part of the linked target webpage that is most related to the source linking context.
We release the AuthorAnchors dataset, a collection of 34K naturally-occurring anchored links.
arXiv Detail & Related papers (2023-05-23T17:58:21Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? [19.862211305690916]
We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training.
Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
arXiv Detail & Related papers (2022-09-14T12:03:31Z) - Anchor Prediction: A Topic Modeling Approach [2.0411082897313984]
We propose an annotation, which we refer to as anchor prediction.
Given a source document and a target document, this task consists in automatically identifying anchors in the source document.
We propose a contextualized relational topic model, CRTM, that models directed links between documents.
arXiv Detail & Related papers (2022-05-29T11:26:52Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - A Multilingual Entity Linking System for Wikipedia with a
Machine-in-the-Loop Approach [2.2889152373118975]
Despite Wikipedia editors' efforts to add and maintain its content, the distribution of links remains sparse in many language editions.
This paper introduces a machine-in-the-loop entity linking system that can comply with community guidelines for adding a link.
We develop an interactive recommendation interface that proposes candidate links to editors who can confirm, reject, or adapt the recommendation.
arXiv Detail & Related papers (2021-05-31T16:29:42Z) - Context-Aware Interaction Network for Question Matching [51.76812857301819]
We propose a context-aware interaction network (COIN) to align two sequences and infer their semantic relationship.
Specifically, each interaction block includes (1) a context-aware cross-attention mechanism to effectively integrate contextual information, and (2) a gate fusion layer to flexibly interpolate aligned representations.
arXiv Detail & Related papers (2021-04-17T05:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.