Related papers: Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

URL: http://arxiv.org/abs/2405.02175v2
Date: Wed, 15 May 2024 17:56:25 GMT
Title: Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset
Authors: Hsuvas Borkakoty, Luis Espinosa-Anke,
Abstract summary: Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of Wikipedia. We first provide a systematic analysis of the similarities and discrepancies between legitimate and hoax Wikipedia articles. We then introduce Hoaxpedia, a collection of 311 Hoax articles alongside semantically similar real articles. We report results of binary classification experiments in the task of predicting whether a Wikipedia article is real or hoax, and analyze several settings as well as a range of language models.
Score: 10.756673240445709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of the similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 Hoax articles (from existing literature as well as official Wikipedia lists) alongside semantically similar real articles. We report results of binary classification experiments in the task of predicting whether a Wikipedia article is real or hoax, and analyze several settings as well as a range of language models. Our results suggest that detecting deceitful content in Wikipedia based on content alone, despite not having been explored much in the past, is a promising direction.

Related papers

Factual Inconsistencies in Multilingual Wikipedia Tables [5.395647076142643]
This study investigates cross-lingual inconsistencies in Wikipedia's structured content.<n>We develop a methodology to collect, align, and analyze tables from Wikipedia multilingual articles.<n>These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems.
arXiv Detail & Related papers (2025-07-24T13:46:14Z)
Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions [10.756673240445709]
We construct a database of discussions happening around articles marked for deletion in several Wikis. We then use to evaluate a range of LMs on different tasks. Our results reveal, among others, that discussions leading to deletion are easier to predict.
arXiv Detail & Related papers (2025-03-13T12:07:35Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
The role of online attention in the supply of disinformation in Wikipedia [0.030458514384586396]
We measure the relationship between allocation of attention and the production of hoax articles on the English Wikipedia. Analysis of traffic logs reveals that, compared to legitimate articles created on the same day, hoaxes tend to be more associated with traffic spikes preceding their creation.
arXiv Detail & Related papers (2023-02-16T20:44:21Z)
Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z)
WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles. The dataset consists of over 80k English samples on 6987 topics. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z)
Improving Wikipedia Verifiability with AI [116.69749668874493]
We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims. Our first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia.
arXiv Detail & Related papers (2022-07-08T15:23:29Z)
Assessing the quality of sources in Wikidata across languages: a hybrid approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z)
Tracking Knowledge Propagation Across Wikipedia Languages [1.8447697408534176]
We present a dataset of inter-language knowledge propagation in Wikipedia. The dataset covers the entire 309 language editions and 33M articles. We find that the size of language editions is associated with the speed of propagation.
arXiv Detail & Related papers (2021-03-30T18:36:13Z)
Analyzing Wikidata Transclusion on English Wikipedia [1.5736899098702972]
This work presents a taxonomy of Wikidata transclusion and an analysis of Wikidata transclusion within English Wikipedia. It finds that Wikidata transclusion that impacts the content of Wikipedia articles happens at a much lower rate (5%) than previous statistics had suggested (61%).
arXiv Detail & Related papers (2020-11-02T14:16:42Z)
Multiple Texts as a Limiting Factor in Online Learning: Quantifying (Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias. The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z)
Entity Extraction from Wikipedia List Pages [2.3605348648054463]
We build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages. We extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.
arXiv Detail & Related papers (2020-03-11T07:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.