Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic
Study
- URL: http://arxiv.org/abs/2310.14032v1
- Date: Sat, 21 Oct 2023 15:00:27 GMT
- Title: Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic
Study
- Authors: Freddy Heppell, Kalina Bontcheva, Carolina Scarton
- Abstract summary: This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com)
We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset.
We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images.
- Score: 6.011001795749255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper analyses two hitherto unstudied sites sharing state-backed
disinformation, Reliable Recent News (rrn.world) and WarOnFakes
(waronfakes.com), which publish content in Arabic, Chinese, English, French,
German, and Spanish. We describe our content acquisition methodology and
perform cross-site unsupervised topic clustering on the resulting multilingual
dataset. We also perform linguistic and temporal analysis of the web page
translations and topics over time, and investigate articles with false
publication dates. We make publicly available this new dataset of 14,053
articles, annotated with each language version, and additional metadata such as
links and images. The main contribution of this paper for the NLP community is
in the novel dataset which enables studies of disinformation networks, and the
training of NLP tools for disinformation detection.
Related papers
- Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page [0.0]
I introduce a new large-scale dataset of historical wire articles from U.S. Southern newspapers, spanning 1960-1975.
Unlike prior work focusing on front-page content, this dataset captures articles across the entire newspaper, offering broader insight into mid-century Southern coverage.
arXiv Detail & Related papers (2025-02-17T14:57:47Z) - Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages.
We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic).
We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z) - POLygraph: Polish Fake News Dataset [0.37698262166557467]
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish.
The dataset is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them.
The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity.
arXiv Detail & Related papers (2024-07-01T15:45:21Z) - EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles [4.895830603263421]
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets.
It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project.
Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages.
arXiv Detail & Related papers (2024-06-18T13:43:22Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.