WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia
- URL: http://arxiv.org/abs/2305.05928v2
- Date: Fri, 29 Dec 2023 21:24:40 GMT
- Title: WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in
Wikipedia
- Authors: Kenichiro Ando, Satoshi Sekine, Mamoru Komachi
- Abstract summary: We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia.
Each sentence is extracted from the entire revision history of English Wikipedia.
WikiSQE has about 3.4 M sentences with 153 quality labels.
- Score: 14.325320851640084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wikipedia can be edited by anyone and thus contains various quality
sentences. Therefore, Wikipedia includes some poor-quality edits, which are
often marked up by other editors. While editors' reviews enhance the
credibility of Wikipedia, it is hard to check all edited text. Assisting in
this process is very important, but a large and comprehensive dataset for
studying it does not currently exist. Here, we propose WikiSQE, the first
large-scale dataset for sentence quality estimation in Wikipedia. Each sentence
is extracted from the entire revision history of English Wikipedia, and the
target quality labels were carefully investigated and selected. WikiSQE has
about 3.4 M sentences with 153 quality labels. In the experiment with automatic
classification using competitive machine learning models, sentences that had
problems with citation, syntax/semantics, or propositions were found to be more
difficult to detect. In addition, by performing human annotation, we found that
the model we developed performed better than the crowdsourced workers. WikiSQE
is expected to be a valuable resource for other tasks in NLP.
Related papers
- Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages [0.19698344608599344]
We propose a novel computational framework for modeling the quality of Wikipedia articles.
Our framework is based on language-agnostic structural features extracted from the articles.
We have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia.
arXiv Detail & Related papers (2024-04-15T13:07:31Z) - Edisum: Summarizing and Explaining Wikipedia Edits at Scale [9.968020416365757]
We propose a model for recommending edit summaries generated by a language model trained to produce good edit summaries.
Our model performs on par with human editors.
More broadly, we showcase how language modeling technology can be used to support humans in maintaining one of the largest and most visible projects on the Web.
arXiv Detail & Related papers (2024-04-04T13:15:28Z) - WikiIns: A High-Quality Dataset for Controlled Text Editing by Natural
Language Instruction [56.196512595940334]
We build and release WikiIns, a high-quality controlled text editing dataset with improved informativeness.
With the high-quality annotated dataset, we propose automatic approaches to generate a large-scale silver'' training set.
arXiv Detail & Related papers (2023-10-08T04:46:39Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Rethink about the Word-level Quality Estimation for Machine Translation
from Human Judgement [57.72846454929923]
We create a benchmark dataset, emphHJQE, where the expert translators directly annotate poorly translated words.
We propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE corpus closer to emphHJQE.
The results show our proposed dataset is more consistent with human judgement and also confirm the effectiveness of the proposed tag correcting strategies.
arXiv Detail & Related papers (2022-09-13T02:37:12Z) - Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims.
Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims.
Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Measuring Wikipedia Article Quality in One Dimension by Extending ORES
with Ordinal Regression [1.52292571922932]
Article quality ratings on English language Wikipedia have been widely used by both Wikipedia community members and academic researchers.
measuring quality presents many methodological challenges.
The most widely used systems use labels on discrete ordinal scales when assessing quality, but such labels can be inconvenient for statistics and machine learning.
arXiv Detail & Related papers (2021-08-15T23:05:28Z) - Wiki-Reliability: A Large Scale Dataset for Content Reliability on
Wikipedia [4.148821165759295]
We build the first dataset of English Wikipedia articles annotated with a wide set of content reliability issues.
To build this dataset, we rely on Wikipedia "templates"
We select the 10 most popular reliability-related templates on Wikipedia, and propose an effective method to label almost 1M samples of Wikipedia article revisions as positive or negative.
arXiv Detail & Related papers (2021-05-10T05:07:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.