WikiHist.html: English Wikipedia's Full Revision History in HTML Format
- URL: http://arxiv.org/abs/2001.10256v3
- Date: Tue, 21 Apr 2020 17:21:28 GMT
- Title: WikiHist.html: English Wikipedia's Full Revision History in HTML Format
- Authors: Blagoj Mitrevski, Tiziano Piccardi, Robert West
- Abstract summary: We develop a parallelized architecture for parsing massive amounts of wikitext using local instances of markup.
We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks.
- Score: 12.86558129722198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wikipedia is written in the wikitext markup language. When serving content,
the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby
inserting additional content by expanding macros (templates and mod-ules).
Hence, researchers who intend to analyze Wikipediaas seen by its readers should
work with HTML, rather than wikitext. Since Wikipedia's revision history is
publicly available exclusively in wikitext format, researchers have had to
produce HTML themselves, typically by using Wikipedia's REST API for ad-hoc
wikitext-to-HTML parsing. This approach, however, (1) does not scale to very
large amounts ofdata and (2) does not correctly expand macros in historical
article revisions. We solve these problems by developing a parallelized
architecture for parsing massive amounts of wikitext using local instances of
MediaWiki, enhanced with the capacity of correct historical macro expansion. By
deploying our system, we produce and release WikiHist.html, English Wikipedia's
full revision history in HTML format. We highlight the advantages of
WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's
hyperlinks, showing that over half of the wiki links present in HTML are
missing from raw wikitext and that the missing links are important for user
navigation.
Related papers
- WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset [48.00110675968677]
We introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page.
WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.
arXiv Detail & Related papers (2023-05-09T13:20:59Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Wikidated 1.0: An Evolving Knowledge Graph Dataset of Wikidata's
Revision History [5.727994421498849]
We present Wikidated 1.0, a dataset of Wikidata's full revision history.
To the best of our knowledge, it constitutes the first large dataset of an evolving knowledge graph.
arXiv Detail & Related papers (2021-12-09T15:54:03Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z) - Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z) - Analyzing Wikidata Transclusion on English Wikipedia [1.5736899098702972]
This work presents a taxonomy of Wikidata transclusion and an analysis of Wikidata transclusion within English Wikipedia.
It finds that Wikidata transclusion that impacts the content of Wikipedia articles happens at a much lower rate (5%) than previous statistics had suggested (61%).
arXiv Detail & Related papers (2020-11-02T14:16:42Z) - Entity Extraction from Wikipedia List Pages [2.3605348648054463]
We build a large taxonomy from categories and list pages with DBpedia as a backbone.
With distant supervision, we extract training data for the identification of new entities in list pages.
We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
arXiv Detail & Related papers (2020-03-11T07:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.