Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic
Study
- URL: http://arxiv.org/abs/2310.14032v1
- Date: Sat, 21 Oct 2023 15:00:27 GMT
- Title: Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic
Study
- Authors: Freddy Heppell, Kalina Bontcheva, Carolina Scarton
- Abstract summary: This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com)
We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset.
We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images.
- Score: 6.011001795749255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper analyses two hitherto unstudied sites sharing state-backed
disinformation, Reliable Recent News (rrn.world) and WarOnFakes
(waronfakes.com), which publish content in Arabic, Chinese, English, French,
German, and Spanish. We describe our content acquisition methodology and
perform cross-site unsupervised topic clustering on the resulting multilingual
dataset. We also perform linguistic and temporal analysis of the web page
translations and topics over time, and investigate articles with false
publication dates. We make publicly available this new dataset of 14,053
articles, annotated with each language version, and additional metadata such as
links and images. The main contribution of this paper for the NLP community is
in the novel dataset which enables studies of disinformation networks, and the
training of NLP tools for disinformation detection.
Related papers
- POLygraph: Polish Fake News Dataset [0.37698262166557467]
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish.
The dataset is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them.
The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity.
arXiv Detail & Related papers (2024-07-01T15:45:21Z) - EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles [4.895830603263421]
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets.
It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project.
Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages.
arXiv Detail & Related papers (2024-06-18T13:43:22Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing.
We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z) - evaluating bert and parsbert for analyzing persian advertisement data [0.0]
The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran.
It presents a competition to predict the percentage of a car sales ad that would be published on the Divar website.
Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it.
arXiv Detail & Related papers (2023-05-03T20:50:05Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.