Related papers: Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions

Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions

URL: http://arxiv.org/abs/2503.10294v1
Date: Thu, 13 Mar 2025 12:07:35 GMT
Title: Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions
Authors: Hsuvas Borkakoty, Luis Espinosa-Anke,
Abstract summary: We construct a database of discussions happening around articles marked for deletion in several Wikis.<n>We then use to evaluate a range of LMs on different tasks.<n>Our results reveal, among others, that discussions leading to deletion are easier to predict.
Score: 10.756673240445709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated content moderation for collaborative knowledge hubs like Wikipedia or Wikidata is an important yet challenging task due to multiple factors. In this paper, we construct a database of discussions happening around articles marked for deletion in several Wikis and in three languages, which we then use to evaluate a range of LMs on different tasks (from predicting the outcome of the discussion to identifying the implicit policy an individual comment might be pointing to). Our results reveal, among others, that discussions leading to deletion are easier to predict, and that, surprisingly, self-produced tags (keep, delete or redirect) don't always help guiding the classifiers, presumably because of users' hesitation or deliberation within comments.

Related papers

WiDe-analysis: Enabling One-click Content Moderation Analysis on Wikipedia's Articles for Deletion [10.756673240445709]
We introduce a suite of experiments on Wikipedia deletion discussions and wide-analyis (Wikipedia Deletion Analysis), a Python package aimed at providing one click analysis to content moderation discussions. We release all assets associated with wide-analysis, including data, models and the Python package, and a HuggingFace space with the goal to accelerate research on automating content moderation in Wikipedia and beyond.
arXiv Detail & Related papers (2024-08-10T23:43:11Z)
Evaluating D-MERIT of Partial-annotation on Information Retrieval [77.44452769932676]
Retrieval models are often evaluated on partially-annotated datasets. We show that using partially-annotated datasets in evaluation can paint a distorted picture.
arXiv Detail & Related papers (2024-06-23T08:24:08Z)
Why Should This Article Be Deleted? Transparent Stance Detection in Multilingual Wikipedia Editor Discussions [47.944081120226905]
We construct a novel dataset of Wikipedia editor discussions along with their reasoning in three languages. The dataset contains the stances of the editors (keep, delete, merge, comment), along with the stated reason, and a content moderation policy, for each edit decision. We demonstrate that stance and corresponding reason (policy) can be predicted jointly with a high degree of accuracy, adding transparency to the decision-making process.
arXiv Detail & Related papers (2023-10-09T15:11:02Z)
Semantic Parsing for Conversational Question Answering over Knowledge Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof. We present two different semantic parsing approaches and highlight the challenges of the task. Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z)
Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z)
WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles. The dataset consists of over 80k English samples on 6987 topics. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z)
Improving Candidate Retrieval with Entity Profile Generation for Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling. We use the profile to query the indexed search engine to retrieve candidate entities. Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z)
Language-agnostic Topic Classification for Wikipedia [1.950869817974852]
We propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics. We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
arXiv Detail & Related papers (2021-02-26T22:17:50Z)
WAC: A Corpus of Wikipedia Conversations for Online Abuse Detection [0.0]
We propose an original framework, based on the Wikipedia Comment corpus, with comment-level annotations of different types. This large corpus of more than 380k annotated messages opens perspectives for online abuse detection and especially for context-based approaches. We also propose, in addition to this corpus, a complete benchmarking platform to stimulate and fairly compare scientific works around the problem of content abuse detection.
arXiv Detail & Related papers (2020-03-13T10:26:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.