A Multilingual Entity Linking System for Wikipedia with a
Machine-in-the-Loop Approach
- URL: http://arxiv.org/abs/2105.15110v1
- Date: Mon, 31 May 2021 16:29:42 GMT
- Title: A Multilingual Entity Linking System for Wikipedia with a
Machine-in-the-Loop Approach
- Authors: Martin Gerlach and Marshall Miller and Rita Ho and Kosta Harlan and
Djellel Difallah
- Abstract summary: Despite Wikipedia editors' efforts to add and maintain its content, the distribution of links remains sparse in many language editions.
This paper introduces a machine-in-the-loop entity linking system that can comply with community guidelines for adding a link.
We develop an interactive recommendation interface that proposes candidate links to editors who can confirm, reject, or adapt the recommendation.
- Score: 2.2889152373118975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hyperlinks constitute the backbone of the Web; they enable user navigation,
information discovery, content ranking, and many other crucial services on the
Internet. In particular, hyperlinks found within Wikipedia allow the readers to
navigate from one page to another to expand their knowledge on a given subject
of interest or to discover a new one. However, despite Wikipedia editors'
efforts to add and maintain its content, the distribution of links remains
sparse in many language editions. This paper introduces a machine-in-the-loop
entity linking system that can comply with community guidelines for adding a
link and aims at increasing link coverage in new pages and wiki-projects with
low-resources. To tackle these challenges, we build a context and language
agnostic entity linking model that combines data collected from millions of
anchors found across wiki-projects, as well as billions of users' reading
sessions. We develop an interactive recommendation interface that proposes
candidate links to editors who can confirm, reject, or adapt the recommendation
with the overall aim of providing a more accessible editing experience for
newcomers through structured tasks. Our system's design choices were made in
collaboration with members of several language communities. When the system is
implemented as part of Wikipedia, its usage by volunteer editors will help us
build a continuous evaluation dataset with active feedback. Our experimental
results show that our link recommender can achieve a precision above 80% while
ensuring a recall of at least 50% across 6 languages covering different sizes,
continents, and families.
Related papers
- MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Orphan Articles: The Dark Matter of Wikipedia [13.290424502717734]
We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles.
We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia.
We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
arXiv Detail & Related papers (2023-06-06T18:04:33Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Predicting Links on Wikipedia with Anchor Text Information [0.571097144710995]
We study the transductive and the inductive tasks of link prediction on several subsets of the English Wikipedia.
We propose an appropriate evaluation sampling methodology and compare several algorithms.
arXiv Detail & Related papers (2021-05-25T07:57:57Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Architecture for a multilingual Wikipedia [0.0]
We argue that we need a new approach to tackle this problem more effectively.
This paper proposes an architecture for a system that fulfills this goal.
It separates the goal in two parts: creating and maintaining content in an abstract notation within a project called Abstract Wikipedia, and creating an infrastructure called Wikilambda that can translate this notation to natural language.
arXiv Detail & Related papers (2020-04-08T22:25:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.