Web2Wiki: Characterizing Wikipedia Linking Across the Web
- URL: http://arxiv.org/abs/2505.15837v1
- Date: Sat, 17 May 2025 00:52:24 GMT
- Title: Web2Wiki: Characterizing Wikipedia Linking Across the Web
- Authors: Veniamin Veselovsky, Tiziano Piccardi, Ashton Anderson, Robert West, Akhil Arora,
- Abstract summary: We identify over 90 million Wikipedia links spanning 1.68% of Web domains.<n>Wikipedia is most frequently cited by news and science websites for informational purposes.<n>Most links serve as explanatory references rather than as evidence or attribution.
- Score: 19.00204665059246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released Web2Wiki dataset includes links from multiple language editions, supporting future research on Wikipedia's global influence on the Web.
Related papers
- WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset [48.00110675968677]
We introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first to retain the full set of images, text, and structure data available in a page.
WikiWeb2M can be used for tasks like page description generation, section summarization, and contextual image captioning.
arXiv Detail & Related papers (2023-05-09T13:20:59Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - A Large Scale Study of Reader Interactions with Images on Wikipedia [2.370481325034443]
This study is the first large-scale analysis of how interactions with images happen on Wikipedia.
We quantify the overall engagement with images, finding that one in 29 results in a click on at least one image.
We observe that clicks on images occur more often in shorter articles and articles about visual arts or transports and biographies of less well-known people.
arXiv Detail & Related papers (2021-12-03T12:02:59Z) - A Map of Science in Wikipedia [0.22843885788439797]
We map the relationship between Wikipedia articles and scientific journal articles.
Most journal articles cited from Wikipedia belong to STEM fields, in particular biology and medicine.
Wikipedia's biographies play an important role in connecting STEM fields with the humanities, especially history.
arXiv Detail & Related papers (2021-10-26T15:44:32Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - A Deeper Investigation of the Importance of Wikipedia Links to the
Success of Search Engines [7.433327915285967]
We report the results of an investigation into the incidence of Wikipedia links in search engine results pages (SERPs)
We find that Wikipedia links are extremely common in important search contexts, appearing in 67-84% of all SERPs for common and trending queries, but less often for medical queries.
Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers.
arXiv Detail & Related papers (2020-04-21T19:58:28Z) - Entity Extraction from Wikipedia List Pages [2.3605348648054463]
We build a large taxonomy from categories and list pages with DBpedia as a backbone.
With distant supervision, we extract training data for the identification of new entities in list pages.
We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
arXiv Detail & Related papers (2020-03-11T07:48:46Z) - Quantifying Engagement with Citations on Wikipedia [13.703047949952852]
One in 300 page views results in a reference click.
Clicks occur more frequently on shorter pages and on pages of lower quality.
Recent content, open access sources and references about life events are particularly popular.
arXiv Detail & Related papers (2020-01-23T15:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.