Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia
- URL: http://arxiv.org/abs/2105.04117v1
- Date: Mon, 10 May 2021 05:07:03 GMT
- Title: Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia
- Authors: KayYen Wong, Miriam Redi, Diego Saez-Trumper
- Abstract summary: We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
- Score: 4.148821165759295
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Wikipedia is the largest online encyclopedia, used by algorithms and web
users as a central hub of reliable information on the web. The quality and
reliability of Wikipedia content is maintained by a community of volunteer
editors. Machine learning and information retrieval algorithms could help scale
up editors' manual efforts around Wikipedia content reliability. However, there
is a lack of large-scale data to support the development of such research. To
fill this gap, in this paper, we propose Wiki-Reliability, the first dataset of
English Wikipedia articles annotated with a wide set of content reliability
issues. To build this dataset, we rely on Wikipedia "templates". Templates are
tags used by expert Wikipedia editors to indicate content issues, such as the
presence of "non-neutral point of view" or "contradictory articles", and serve
as a strong signal for detecting reliability issues in a revision. We select
the 10 most popular reliability-related templates on Wikipedia, and propose an
effective method to label almost 1M samples of Wikipedia article revisions as
positive or negative with respect to each template. Each positive/negative
example in the dataset comes with the full article text and 20 features from
the revision's metadata. We provide an overview of the possible downstream
tasks enabled by such data, and show that Wiki-Reliability can be used to train
large-scale models for content reliability prediction. We release all data and
code for public use.
Related papers
- Publishing Wikipedia usage data with strong privacy guarantees [6.410779699541235]
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day.
In June 2023, the Wikimedia Foundation started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views.
This paper describes this data publication: its goals, the process followed from its inception to its deployment, and the outcomes of the data release.
arXiv Detail & Related papers (2023-08-30T19:58:56Z) - Orphan Articles: The Dark Matter of Wikipedia [13.290424502717734]
We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles.
We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia.
We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
arXiv Detail & Related papers (2023-06-06T18:04:33Z) - WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia [14.325320851640084]
We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.
Each sentence is extracted from the entire revision history of English Wikipedia.
WikiSQE has about 3.4 M sentences with 153 quality labels.
arXiv Detail & Related papers (2023-05-10T06:45:13Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z) - Scalable Recommendation of Wikipedia Articles to Editors Using
Representation Learning [1.8810916321241067]
We develop a scalable system on top of Graph Convolutional Networks and Doc2Vec, learning how to represent Wikipedia articles and deliver personalized recommendations for editors.
We test our model on editors' histories, predicting their most recent edits based on their prior edits.
All of the data used on this paper is publicly available, including graph embeddings for Wikipedia articles, and we release our code to support replication of our experiments.
arXiv Detail & Related papers (2020-09-24T15:56:02Z) - WikiHist.html: English Wikipedia's Full Revision History in HTML Format [12.86558129722198]
We develop a parallelized architecture for parsing massive amounts of wikitext using local instances of markup.
We highlight the advantages of WikiHist.html over raw wikitext in an empirical analysis of Wikipedia's hyperlinks.
arXiv Detail & Related papers (2020-01-28T10:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.