'I Updated the [': The Evolution of References in the English
Wikipedia and the Implications for Altmetrics]
- URL: http://arxiv.org/abs/2010.03083v1
- Date: Tue, 6 Oct 2020 23:26:12 GMT
- Title: 'I Updated the <ref>': The Evolution of References in the English
Wikipedia and the Implications for Altmetrics
- Authors: Olga Zagovora, Roberto Ulloa, Katrin Weller, Fabian Fl\"ock
- Abstract summary: We present a dataset of the history of all the references (more than 55 million) ever used in the English Wikipedia until June 2019.
We have applied a new method for identifying and monitoring references in Wikipedia, so that for each reference we can provide data about associated actions: creation, modifications, deletions, and reinsertions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With this work, we present a publicly available dataset of the history of all
the references (more than 55 million) ever used in the English Wikipedia until
June 2019. We have applied a new method for identifying and monitoring
references in Wikipedia, so that for each reference we can provide data about
associated actions: creation, modifications, deletions, and reinsertions. The
high accuracy of this method and the resulting dataset was confirmed via a
comprehensive crowdworker labelling campaign. We use the dataset to study the
temporal evolution of Wikipedia references as well as users' editing behaviour.
We find evidence of a mostly productive and continuous effort to improve the
quality of references: (1) there is a persistent increase of reference and
document identifiers (DOI, PubMedID, PMC, ISBN, ISSN, ArXiv ID), and (2) most
of the reference curation work is done by registered humans (not bots or
anonymous editors). We conclude that the evolution of Wikipedia references,
including the dynamics of the community processes that tend to them should be
leveraged in the design of relevance indexes for altmetrics, and our dataset
can be pivotal for such effort.
Related papers
- HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits [92.62157408704594]
HelloFresh is based on continuous streams of real-world data generated by intrinsically motivated human labelers.
It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages.
It mitigates the risk of test data contamination and benchmark overfitting.
arXiv Detail & Related papers (2024-06-05T16:25:57Z) - Longitudinal Assessment of Reference Quality on Wikipedia [7.823541290904653]
This work analyzes the reliability of this global encyclopedia through the lens of its references.
We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references.
arXiv Detail & Related papers (2023-03-09T13:04:14Z) - Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation [92.1582872870226]
We propose a new grounded keys-to-text generation task.
The task is to generate a factual description about an entity given a set of guiding keys, and grounding passages.
Inspired by recent QA-based evaluation measures, we propose an automatic metric, MAFE, for factual correctness of generated descriptions.
arXiv Detail & Related papers (2022-12-04T23:59:41Z) - Data-Efficient Autoregressive Document Retrieval for Fact Verification [7.935530801269922]
This paper introduces a distant-supervision method that does not require any annotation to train autoregressive retrievers.
We show that with task-specific supervised fine-tuning, autoregressive retrieval performance for two Wikipedia-based fact verification tasks can approach or even exceed full supervision.
arXiv Detail & Related papers (2022-11-17T07:27:50Z) - Surfer100: Generating Surveys From Web Resources on Wikipedia-style [49.23675182917996]
We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation.
We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys.
arXiv Detail & Related papers (2021-12-13T02:18:01Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Knowledge graph based methods for record linkage [0.0]
We propose the knowledge graph use to tackle record linkage task.
The proposed method, named bf WERL, takes advantage of the main knowledge graph properties and learns embedding vectors to encode census information.
We have evaluated this method on benchmark data sets and we have compared it to related methods with stimulating and satisfactory results.
arXiv Detail & Related papers (2020-03-06T11:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.