Measuring Wikipedia Article Quality in One Dimension by Extending ORES
with Ordinal Regression
- URL: http://arxiv.org/abs/2108.10684v1
- Date: Sun, 15 Aug 2021 23:05:28 GMT
- Title: Measuring Wikipedia Article Quality in One Dimension by Extending ORES
with Ordinal Regression
- Authors: Nathan TeBlunthuis
- Abstract summary: Article quality ratings on English language Wikipedia have been widely used by both Wikipedia community members and academic researchers.
measuring quality presents many methodological challenges.
The most widely used systems use labels on discrete ordinal scales when assessing quality, but such labels can be inconvenient for statistics and machine learning.
- Score: 1.52292571922932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Organizing complex peer production projects and advancing scientific
knowledge of open collaboration each depend on the ability to measure quality.
Article quality ratings on English language Wikipedia have been widely used by
both Wikipedia community members and academic researchers for purposes like
tracking knowledge gaps and studying how political polarization shapes
collaboration. Even so, measuring quality presents many methodological
challenges. The most widely used systems use labels on discrete ordinal scales
when assessing quality, but such labels can be inconvenient for statistics and
machine learning. Prior work handles this by assuming that different levels of
quality are "evenly spaced" from one another. This assumption runs counter to
intuitions about the relative degrees of effort needed to raise Wikipedia
encyclopedia articles to different quality levels. Furthermore, models from
prior work are fit to datasets that oversample high-quality articles. This
limits their accuracy for representative samples of articles or revisions. I
describe a technique extending the Wikimedia Foundations' ORES article quality
model to address these limitations. My method uses weighted ordinal regression
models to construct one-dimensional continuous measures of quality. While
scores from my technique and from prior approaches are correlated, my approach
improves accuracy for research datasets and provides evidence that the "evenly
spaced" assumption is unfounded in practice on English Wikipedia. I conclude
with recommendations for using quality scores in future research and include
the full code, data, and models.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Automatic Quality Assessment of Wikipedia Articles -- A Systematic
Literature Review [0.8158530638728501]
We review existing methods for automatically measuring the quality of Wikipedia articles.
We identify and comparing machine learning algorithms, article features, quality metrics, and used datasets.
We hope that our analysis helps future researchers change that reality.
arXiv Detail & Related papers (2023-10-03T17:45:39Z) - Data Quality in Imitation Learning [15.939363481618738]
In offline learning for robotics, we simply lack internet scale data, and so high quality datasets are a necessity.
This is especially true in imitation learning (IL), a sample efficient paradigm for robot learning using expert demonstrations.
In this work, we take the first step toward formalizing data quality for imitation learning through the lens of distribution shift.
arXiv Detail & Related papers (2023-06-04T18:48:32Z) - WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia [14.325320851640084]
We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.
Each sentence is extracted from the entire revision history of English Wikipedia.
WikiSQE has about 3.4 M sentences with 153 quality labels.
arXiv Detail & Related papers (2023-05-10T06:45:13Z) - Fairness meets Cross-Domain Learning: a new perspective on Models and
Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness.
We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks.
Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z) - A Study on the Evaluation of Generative Models [19.18642459565609]
Implicit generative models, which do not return likelihood values, have become prevalent in recent years.
In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset.
Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably.
arXiv Detail & Related papers (2022-06-22T09:27:31Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Are Missing Links Predictable? An Inferential Benchmark for Knowledge
Graph Completion [79.07695173192472]
InferWiki improves upon existing benchmarks in inferential ability, assumptions, and patterns.
Each testing sample is predictable with supportive data in the training set.
In experiments, we curate two settings of InferWiki varying in sizes and structures, and apply the construction process on CoDEx as comparative datasets.
arXiv Detail & Related papers (2021-08-03T09:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.