Surveying Wikipedians: a dataset of users and contributors' practices on
Wikipedia in 8 languages
- URL: http://arxiv.org/abs/2311.07964v2
- Date: Tue, 5 Dec 2023 09:04:12 GMT
- Title: Surveying Wikipedians: a dataset of users and contributors' practices on
Wikipedia in 8 languages
- Authors: Caterina Cruciani, L\'eo Joubert (LEST, DySoLab), Nicolas Jullien (IMT
Atlantique - LUSSI, MARSOUIN, LEGO), Laurent Mell (CREAD, MARSOUIN), Sasha
Piccione, Jeanne Vermeirsche (AU)
- Abstract summary: dataset focuses on Wikipedia users and contains information about demographic and socioeconomic characteristics of the respondents.
The data was collected using a questionnaire available online between June and July 2023.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The dataset focuses on Wikipedia users and contains information about
demographic and socioeconomic characteristics of the respondents and their
activity on Wikipedia. The data was collected using a questionnaire available
online between June and July 2023. The link to the questionnaire was
distributed via a banner published in 8 languages on the Wikipedia page.
Filling out the questionnaire was voluntary and not incentivised in any way.
The survey includes 200 questions about: what people were doing on Wikipedia
before clicking the link to the questionnaire; how they use Wikipedia as
readers (``professional'' and ``personal'' uses); their opinion on the quality,
the thematic coverage, the importance of the encyclopaedia; the making of
Wikipedia (how they think it is made, if they have ever contributed and how);
their social, sport, artistic and cultural activities, both online and offline;
their socio-economic characteristics including political beliefs, and trust
propensities. More than 200 000 people opened the questionnaire, 100 332
started to answer, and constitute our dataset, and 10 576 finished it. Among
other themes identified by future researchers, the dataset can be useful for
advancing the research regarding the features of readers vs contributors of
online commons, the relationship between trust, information, sources, and the
use made of this information.
Related papers
- How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading [60.19226384241482]
We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles.
We explore various approaches to generate such questions using language models.
We conduct a human study to understand the implication of such questions on reading comprehension.
arXiv Detail & Related papers (2024-07-19T13:42:56Z) - Publishing Wikipedia usage data with strong privacy guarantees [6.410779699541235]
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day.
In June 2023, the Wikimedia Foundation started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views.
This paper describes this data publication: its goals, the process followed from its inception to its deployment, and the outcomes of the data release.
arXiv Detail & Related papers (2023-08-30T19:58:56Z) - Wiki-based Communities of Interest: Demographics and Outliers [18.953455338226103]
Identified from Wiki-based sources, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force.
We release subject-centric and group-centric datasets in format, as well as a browsing interface.
arXiv Detail & Related papers (2023-03-16T09:58:11Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Wikipedia Reader Navigation: When Synthetic Data Is Enough [11.99768070409472]
We quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data.
We find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%.
This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource.
arXiv Detail & Related papers (2022-01-03T18:58:39Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing
Topics? [8.035521056416242]
We focus on the influence of the interconnect topology between articles describing complementary aspects of polarizing topics.
We introduce a novel measure of exposure to diverse information to quantify users' exposure to different aspects of a topic.
We identify cases in which the network topology significantly limits the exposure of users to diverse information on the topic, encouraging users to remain in a knowledge bubble.
arXiv Detail & Related papers (2020-07-16T09:19:57Z) - Quantifying Engagement with Citations on Wikipedia [13.703047949952852]
One in 300 page views results in a reference click.
Clicks occur more frequently on shorter pages and on pages of lower quality.
Recent content, open access sources and references about life events are particularly popular.
arXiv Detail & Related papers (2020-01-23T15:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.