Entity Extraction from Wikipedia List Pages
- URL: http://arxiv.org/abs/2003.05146v1
- Date: Wed, 11 Mar 2020 07:48:46 GMT
- Title: Entity Extraction from Wikipedia List Pages
- Authors: Nicolas Heist and Heiko Paulheim
- Abstract summary: We build a large taxonomy from categories and list pages with DBpedia as a backbone.
With distant supervision, we extract training data for the identification of new entities in list pages.
We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
- Score: 2.3605348648054463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When it comes to factual knowledge about a wide range of domains, Wikipedia
is often the prime source of information on the web. DBpedia and YAGO, as large
cross-domain knowledge graphs, encode a subset of that knowledge by creating an
entity for each page in Wikipedia, and connecting them through edges. It is
well known, however, that Wikipedia-based knowledge graphs are far from
complete. Especially, as Wikipedia's policies permit pages about subjects only
if they have a certain popularity, such graphs tend to lack information about
less well-known entities. Information about these entities is oftentimes
available in the encyclopedia, but not represented as an individual page. In
this paper, we present a two-phased approach for the extraction of entities
from Wikipedia's list pages, which have proven to serve as a valuable source of
information. In the first phase, we build a large taxonomy from categories and
list pages with DBpedia as a backbone. With distant supervision, we extract
training data for the identification of new entities in list pages that we use
in the second phase to train a classification model. With this approach we
extract over 700k new entities and extend DBpedia with 7.5M new type statements
and 3.8M new facts of high precision.
Related papers
- Towards a Brazilian History Knowledge Graph [50.26735825937335]
We construct a knowledge graph for Brazilian history based on the Brazilian Dictionary of Historical Biographies (DHBB) and Wikipedia/Wikidata.
We show that many terms/entities described in the DHBB do not have corresponding concepts (or Q items) in Wikidata.
arXiv Detail & Related papers (2024-03-28T22:05:32Z) - Open-domain Visual Entity Recognition: Towards Recognizing Millions of
Wikipedia Entities [54.26896306906937]
We present OVEN-Wiki, where a model need to link an image onto a Wikipedia entity with respect to a text query.
We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning.
While PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.
arXiv Detail & Related papers (2023-02-22T05:31:26Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - A Map of Science in Wikipedia [0.22843885788439797]
We map the relationship between Wikipedia articles and scientific journal articles.
Most journal articles cited from Wikipedia belong to STEM fields, in particular biology and medicine.
Wikipedia's biographies play an important role in connecting STEM fields with the humanities, especially history.
arXiv Detail & Related papers (2021-10-26T15:44:32Z) - What if we had no Wikipedia? Domain-independent Term Extraction from a
Large News Corpus [9.081222401894552]
We aim to identify "wiki-worthy" terms in a massive news corpus, and see if this can be done with no, or minimal, dependency on actual Wikipedia entries.
Our work sheds new light on the domain-specific Automatic Term Extraction problem, with the problem at hand being a domain-independent variant of it.
arXiv Detail & Related papers (2020-09-17T12:45:46Z) - Quantifying Engagement with Citations on Wikipedia [13.703047949952852]
One in 300 page views results in a reference click.
Clicks occur more frequently on shorter pages and on pages of lower quality.
Recent content, open access sources and references about life events are particularly popular.
arXiv Detail & Related papers (2020-01-23T15:52:36Z) - Classifying Wikipedia in a fine-grained hierarchy: what graphs can
contribute [0.5530212768657543]
We address the task of integrating graph (i.e. structure) information to classify Wikipedia into a fine-grained named entity ontology (NE)
We conduct at-scale practical experiments, on a manually labeled subset of 22,000 pages extracted from the Japanese Wikipedia.
Our results show that integrating graph information succeeds at reducing sparsity of the input feature space, and yields classification results that are comparable or better than previous works.
arXiv Detail & Related papers (2020-01-21T14:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.