What if we had no Wikipedia? Domain-independent Term Extraction from a
Large News Corpus
- URL: http://arxiv.org/abs/2009.08240v1
- Date: Thu, 17 Sep 2020 12:45:46 GMT
- Title: What if we had no Wikipedia? Domain-independent Term Extraction from a
Large News Corpus
- Authors: Yonatan Bilu, Shai Gretz, Edo Cohen and Noam Slonim
- Abstract summary: We aim to identify "wiki-worthy" terms in a massive news corpus, and see if this can be done with no, or minimal, dependency on actual Wikipedia entries.
Our work sheds new light on the domain-specific Automatic Term Extraction problem, with the problem at hand being a domain-independent variant of it.
- Score: 9.081222401894552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most impressive human endeavors of the past two decades is the
collection and categorization of human knowledge in the free and accessible
format that is Wikipedia. In this work we ask what makes a term worthy of
entering this edifice of knowledge, and having a page of its own in Wikipedia?
To what extent is this a natural product of on-going human discourse and
discussion rather than an idiosyncratic choice of Wikipedia editors?
Specifically, we aim to identify such "wiki-worthy" terms in a massive news
corpus, and see if this can be done with no, or minimal, dependency on actual
Wikipedia entries. We suggest a five-step pipeline for doing so, providing
baseline results for all five, and the relevant datasets for benchmarking them.
Our work sheds new light on the domain-specific Automatic Term Extraction
problem, with the problem at hand being a domain-independent variant of it.
Related papers
- Orphan Articles: The Dark Matter of Wikipedia [13.290424502717734]
We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles.
We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia.
We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
arXiv Detail & Related papers (2023-06-06T18:04:33Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - The Web Is Your Oyster -- Knowledge-Intensive NLP against a Very Large
Web Corpus [76.9522248303716]
We propose a new setup for evaluating existing KI-NLP tasks in which we generalize the background corpus to a universal web snapshot.
We repurpose KILT, a standard KI-NLP benchmark initially developed for Wikipedia, and ask systems to use a subset of CCNet - the Sphere corpus.
We find that despite potential gaps of coverage, challenges of scale, lack of structure and lower quality, retrieval from Sphere enables a state-of-the-art-and-read system to match and even outperform Wikipedia-based models.
arXiv Detail & Related papers (2021-12-18T13:15:34Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - Architecture for a multilingual Wikipedia [0.0]
We argue that we need a new approach to tackle this problem more effectively.
This paper proposes an architecture for a system that fulfills this goal.
It separates the goal in two parts: creating and maintaining content in an abstract notation within a project called Abstract Wikipedia, and creating an infrastructure called Wikilambda that can translate this notation to natural language.
arXiv Detail & Related papers (2020-04-08T22:25:10Z) - Entity Extraction from Wikipedia List Pages [2.3605348648054463]
We build a large taxonomy from categories and list pages with DBpedia as a backbone.
With distant supervision, we extract training data for the identification of new entities in list pages.
We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
arXiv Detail & Related papers (2020-03-11T07:48:46Z) - WikiHist.html: English Wikipedia's Full Revision History in HTML Format [12.86558129722198]
We develop a parallelized architecture for parsing massive amounts of wikitext using local instances of markup.
We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks.
arXiv Detail & Related papers (2020-01-28T10:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.