Wikipedia Reader Navigation: When Synthetic Data Is Enough
- URL: http://arxiv.org/abs/2201.00812v2
- Date: Wed, 5 Jan 2022 17:46:04 GMT
- Title: Wikipedia Reader Navigation: When Synthetic Data Is Enough
- Authors: Akhil Arora, Martin Gerlach, Tiziano Piccardi, Alberto
Garc\'ia-Dur\'an, Robert West
- Abstract summary: We quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data.
We find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%.
This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource.
- Score: 11.99768070409472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Every day millions of people read Wikipedia. When navigating the vast space
of available topics using hyperlinks, readers describe trajectories on the
article network. Understanding these navigation patterns is crucial to better
serve readers' needs and address structural biases and knowledge gaps. However,
systematic studies of navigation on Wikipedia are hindered by a lack of
publicly available data due to the commitment to protect readers' privacy by
not storing or sharing potentially sensitive data. In this paper, we ask: How
well can Wikipedia readers' navigation be approximated by using publicly
available resources, most notably the Wikipedia clickstream data? We
systematically quantify the differences between real navigation sequences and
synthetic sequences generated from the clickstream data, in 6 analyses across 8
Wikipedia language versions. Overall, we find that the differences between real
and synthetic sequences are statistically significant, but with small effect
sizes, often well below 10%. This constitutes quantitative evidence for the
utility of the Wikipedia clickstream data as a public resource: clickstream
data can closely capture reader navigation on Wikipedia and provides a
sufficient approximation for most practical downstream applications relying on
reader data. More broadly, this study provides an example for how
clickstream-like data can generally enable research on user navigation on
online platforms while protecting users' privacy.
Related papers
- WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - A Large-Scale Characterization of How Readers Browse Wikipedia [13.106604261718381]
We present the first systematic large-scale analysis of how readers browse Wikipedia.
Using billions of page requests from Wikipedia's server logs, we measure how readers reach articles.
We find that navigation behavior is characterized by highly diverse structures.
arXiv Detail & Related papers (2021-12-22T12:54:44Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z) - Ranking the information content of distance measures [61.754016309475745]
We introduce a statistical test that can assess the relative information retained when using two different distance measures.
This in turn allows finding the most informative distance measure out of a pool of candidates.
arXiv Detail & Related papers (2021-04-30T15:57:57Z) - Tracking Knowledge Propagation Across Wikipedia Languages [1.8447697408534176]
We present a dataset of inter-language knowledge propagation in Wikipedia.
The dataset covers the entire 309 language editions and 33M articles.
We find that the size of language editions is associated with the speed of propagation.
arXiv Detail & Related papers (2021-03-30T18:36:13Z) - How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing
Topics? [8.035521056416242]
We focus on the influence of the interconnect topology between articles describing complementary aspects of polarizing topics.
We introduce a novel measure of exposure to diverse information to quantify users' exposure to different aspects of a topic.
We identify cases in which the network topology significantly limits the exposure of users to diverse information on the topic, encouraging users to remain in a knowledge bubble.
arXiv Detail & Related papers (2020-07-16T09:19:57Z) - Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories [110.54963847339775]
We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
arXiv Detail & Related papers (2020-04-06T12:11:50Z) - Quantifying Engagement with Citations on Wikipedia [13.703047949952852]
One in 300 page views results in a reference click.
Clicks occur more frequently on shorter pages and on pages of lower quality.
Recent content, open access sources and references about life events are particularly popular.
arXiv Detail & Related papers (2020-01-23T15:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.