Tracking Knowledge Propagation Across Wikipedia Languages
- URL: http://arxiv.org/abs/2103.16613v1
- Date: Tue, 30 Mar 2021 18:36:13 GMT
- Title: Tracking Knowledge Propagation Across Wikipedia Languages
- Authors: Roldolfo Valentim, Giovanni Comarela, Souneil Park and Diego
Saez-Trumper
- Abstract summary: We present a dataset of inter-language knowledge propagation in Wikipedia.
The dataset covers the entire 309 language editions and 33M articles.
We find that the size of language editions is associated with the speed of propagation.
- Score: 1.8447697408534176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a dataset of inter-language knowledge propagation
in Wikipedia. Covering the entire 309 language editions and 33M articles, the
dataset aims to track the full propagation history of Wikipedia concepts, and
allow follow up research on building predictive models of them. For this
purpose, we align all the Wikipedia articles in a language-agnostic manner
according to the concept they cover, which results in 13M propagation
instances. To the best of our knowledge, this dataset is the first to explore
the full inter-language propagation at a large scale. Together with the
dataset, a holistic overview of the propagation and key insights about the
underlying structural factors are provided to aid future research. For example,
we find that although long cascades are unusual, the propagation tends to
continue further once it reaches more than four language editions. We also find
that the size of language editions is associated with the speed of propagation.
We believe the dataset not only contributes to the prior literature on
Wikipedia growth but also enables new use cases such as edit recommendation for
addressing knowledge gaps, detection of disinformation, and cultural
relationship analysis.
Related papers
- Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia [49.80565462746646]
We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level.
We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias.
arXiv Detail & Related papers (2024-10-05T20:40:49Z) - An Open Multilingual System for Scoring Readability of Wikipedia [3.992677070507323]
We develop a multilingual model to score the readability of Wikipedia articles.
We create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online childrens.
We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages.
arXiv Detail & Related papers (2024-06-03T23:07:18Z) - Curious Rhythms: Temporal Regularities of Wikipedia Consumption [15.686850035802667]
We show that even after removing the global pattern of day-night alternation, the consumption habits of individual articles maintain strong diurnal regularities.
We investigate topical and contextual correlates of Wikipedia articles' access rhythms, finding that article topic, reader country, and access device (mobile vs. desktop) are all important predictors of daily attention patterns.
arXiv Detail & Related papers (2023-05-16T14:48:08Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Considerations for Multilingual Wikipedia Research [1.5736899098702972]
Non-English language editions of Wikipedia have led to the inclusion of many more language editions in datasets and models.
This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia.
arXiv Detail & Related papers (2022-04-05T20:34:15Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.