Related papers: EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles

EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles

URL: http://arxiv.org/abs/2406.12614v4
Date: Fri, 30 Aug 2024 12:40:04 GMT
Title: EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles
Authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton,
Abstract summary: This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages.
Score: 4.895830603263421
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets, along with trustworthy articles from credible / less biased sources. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting certain disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.

Related papers

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking [2.321323878201932]
MultiSynFact is the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia. We open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
arXiv Detail & Related papers (2025-02-21T12:38:26Z)
When a Language Question Is at Stake. A Revisited Approach to Label Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war. We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z)
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages. It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z)
Lost in Translation -- Multilingual Misinformation and its Evolution [52.07628580627591]
This paper investigates the prevalence and dynamics of multilingual misinformation through an analysis of over 250,000 unique fact-checks spanning 95 languages. We find that while the majority of misinformation claims are only fact-checked once, 11.7%, corresponding to more than 21,000 claims, are checked multiple times. Using fact-checks as a proxy for the spread of misinformation, we find 33% of repeated claims cross linguistic boundaries.
arXiv Detail & Related papers (2023-10-27T12:21:55Z)
Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study [6.011001795749255]
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com) We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset. We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images.
arXiv Detail & Related papers (2023-10-21T15:00:27Z)
Multiverse: Multilingual Evidence for Fake News Detection [71.51905606492376]
Multiverse is a new feature based on multilingual evidence that can be used for fake news detection. The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed.
arXiv Detail & Related papers (2022-11-25T18:24:17Z)
Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z)
CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.