Orphan Articles: The Dark Matter of Wikipedia
- URL: http://arxiv.org/abs/2306.03940v2
- Date: Sat, 05 Oct 2024 14:17:15 GMT
- Title: Orphan Articles: The Dark Matter of Wikipedia
- Authors: Akhil Arora, Robert West, Martin Gerlach,
- Abstract summary: We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles.
We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia.
We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
- Score: 13.290424502717734
- License:
- Abstract: With 60M articles in more than 300 language versions, Wikipedia is the largest platform for open and freely accessible knowledge. While the available content has been growing continuously at a rate of around 200K new articles each month, very little attention has been paid to the accessibility of the content. One crucial aspect of accessibility is the integration of hyperlinks into the network so the articles are visible to readers navigating Wikipedia. In order to understand this phenomenon, we conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles, across 319 different language versions of Wikipedia. We find that a surprisingly large extent of content, roughly 15\% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia, and thus, rightfully term orphan articles as the dark matter of Wikipedia. We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility in terms of the number of pageviews. We further highlight the challenges faced by editors for de-orphanizing articles, demonstrate the need to support them in addressing this issue, and provide potential solutions for developing automated tools based on cross-lingual approaches. Overall, our work not only unravels a key limitation in the link structure of Wikipedia and quantitatively assesses its impact, but also provides a new perspective on the challenges of maintenance associated with content creation at scale in Wikipedia.
Related papers
- An Open Multilingual System for Scoring Readability of Wikipedia [3.992677070507323]
We develop a multilingual model to score the readability of Wikipedia articles.
We create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online childrens.
We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages.
arXiv Detail & Related papers (2024-06-03T23:07:18Z) - Kuaipedia: a Large-scale Multi-modal Short-video Encyclopedia [59.47639408597319]
Kuaipedia is a large-scale multi-modal encyclopedia consisting of items, aspects, and short videos lined to them.
It was extracted from billions of videos of Kuaishou, a well-known short-video platform in China.
arXiv Detail & Related papers (2022-10-28T12:54:30Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - A Large Scale Study of Reader Interactions with Images on Wikipedia [2.370481325034443]
This study is the first large-scale analysis of how interactions with images happen on Wikipedia.
We quantify the overall engagement with images, finding that one in 29 results in a click on at least one image.
We observe that clicks on images occur more often in shorter articles and articles about visual arts or transports and biographies of less well-known people.
arXiv Detail & Related papers (2021-12-03T12:02:59Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z) - Language-agnostic Topic Classification for Wikipedia [1.950869817974852]
We propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics.
We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
arXiv Detail & Related papers (2021-02-26T22:17:50Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - How Inclusive Are Wikipedia's Hyperlinks in Articles Covering Polarizing
Topics? [8.035521056416242]
We focus on the influence of the interconnect topology between articles describing complementary aspects of polarizing topics.
We introduce a novel measure of exposure to diverse information to quantify users' exposure to different aspects of a topic.
We identify cases in which the network topology significantly limits the exposure of users to diverse information on the topic, encouraging users to remain in a knowledge bubble.
arXiv Detail & Related papers (2020-07-16T09:19:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.