POLygraph: Polish Fake News Dataset
- URL: http://arxiv.org/abs/2407.01393v1
- Date: Mon, 1 Jul 2024 15:45:21 GMT
- Title: POLygraph: Polish Fake News Dataset
- Authors: Daniel Dzienisiewicz, Filip Graliński, Piotr Jabłoński, Marek Kubis, Paweł Skórzewski, Piotr Wierzchoń,
- Abstract summary: This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish.
The dataset is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them.
The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity.
- Score: 0.37698262166557467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles [41.97931444618385]
We present a manually coded cross-lingual dataset of authors of online news articles.
We use it to evaluate the performance of five existing software packages and one customized model.
Go-readability and Trafilatura are the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages.
arXiv Detail & Related papers (2024-10-13T20:19:15Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - FineFake: A Knowledge-Enriched Dataset for Fine-Grained Multi-Domain Fake News Detection [54.37159298632628]
FineFake is a multi-domain knowledge-enhanced benchmark for fake news detection.
FineFake encompasses 16,909 data samples spanning six semantic topics and eight platforms.
The entire FineFake project is publicly accessible as an open-source repository.
arXiv Detail & Related papers (2024-03-30T14:39:09Z) - FaKnow: A Unified Library for Fake News Detection [11.119667583594483]
FaKnow is a unified and comprehensive fake news detection algorithm library.
It covers the full spectrum of the model training and evaluation process.
It furnishes a series of auxiliary functionalities and tools, including visualization, and logging.
arXiv Detail & Related papers (2024-01-27T13:29:17Z) - Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic
Study [6.011001795749255]
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com)
We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset.
We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images.
arXiv Detail & Related papers (2023-10-21T15:00:27Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection.
The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z) - Towards A Reliable Ground-Truth For Biased Language Detection [3.2202224129197745]
Existing methods to detect bias mostly rely on annotated data to train machine learning models.
We evaluate data collection options and compare labels obtained from two popular crowdsourcing platforms.
We conclude that detailed annotator training increases data quality, improving the performance of existing bias detection systems.
arXiv Detail & Related papers (2021-12-14T14:13:05Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z) - BanFakeNews: A Dataset for Detecting Fake News in Bangla [1.4170999534105675]
We propose an annotated dataset of 50K news that can be used for building automated fake news detection systems.
We develop a benchmark system with state of the art NLP techniques to identify Bangla fake news.
arXiv Detail & Related papers (2020-04-19T07:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.