Publishing Wikipedia usage data with strong privacy guarantees
- URL: http://arxiv.org/abs/2308.16298v2
- Date: Fri, 1 Sep 2023 18:07:46 GMT
- Title: Publishing Wikipedia usage data with strong privacy guarantees
- Authors: Temilola Adeleye, Skye Berghel, Damien Desfontaines, Michael Hay, Isaac Johnson, Cléo Lemoisson, Ashwin Machanavajjhala, Tom Magerlein, Gabriele Modena, David Pujol, Daniel Simmons-Marengo, Hal Triedman,
- Abstract summary: For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day.
In June 2023, the Wikimedia Foundation started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views.
This paper describes this data publication: its goals, the process followed from its inception to its deployment, and the outcomes of the data release.
- Score: 6.410779699541235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors and academic researchers: it started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views. This new data publication uses differential privacy to provide robust guarantees to people browsing or editing Wikipedia. This paper describes this data publication: its goals, the process followed from its inception to its deployment, the algorithms used to produce the data, and the outcomes of the data release.
Related papers
- A Test of Time: Predicting the Sustainable Success of Online Collaboration in Wikipedia [17.051622145253855]
We develop machine learning models to predict the sustainable success of Wikipedia articles.
We find that the longer an article takes to be recognized as high-quality, the more likely it is to maintain that status over time.
Our analysis provides insights into broader collective actions beyond Wikipedia.
arXiv Detail & Related papers (2024-10-24T20:42:53Z) - Surveying Wikipedians: a dataset of users and contributors' practices on
Wikipedia in 8 languages [0.0]
dataset focuses on Wikipedia users and contains information about demographic and socioeconomic characteristics of the respondents.
The data was collected using a questionnaire available online between June and July 2023.
arXiv Detail & Related papers (2023-11-14T07:39:27Z) - Leveraging Wikidata's edit history in knowledge graph refinement tasks [77.34726150561087]
edit history represents the process in which the community reaches some kind of fuzzy and distributed consensus.
We build a dataset containing the edit history of every instance from the 100 most important classes in Wikidata.
We propose and evaluate two new methods to leverage this edit history information in knowledge graph embedding models for type prediction tasks.
arXiv Detail & Related papers (2022-10-27T14:32:45Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Wikipedia Reader Navigation: When Synthetic Data Is Enough [11.99768070409472]
We quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data.
We find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%.
This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource.
arXiv Detail & Related papers (2022-01-03T18:58:39Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z) - Quantifying Engagement with Citations on Wikipedia [13.703047949952852]
One in 300 page views results in a reference click.
Clicks occur more frequently on shorter pages and on pages of lower quality.
Recent content, open access sources and references about life events are particularly popular.
arXiv Detail & Related papers (2020-01-23T15:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.