Related papers: Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study

Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study

URL: http://arxiv.org/abs/2310.14032v1
Date: Sat, 21 Oct 2023 15:00:27 GMT
Title: Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study
Authors: Freddy Heppell, Kalina Bontcheva, Carolina Scarton
Abstract summary: This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com) We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset. We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images.
Score: 6.011001795749255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish. We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset. We also perform linguistic and temporal analysis of the web page translations and topics over time, and investigate articles with false publication dates. We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images. The main contribution of this paper for the NLP community is in the novel dataset which enables studies of disinformation networks, and the training of NLP tools for disinformation detection.

Related papers

A Python Tool for Reconstructing Full News Text from GDELT [0.0]
This paper presents a novel approach to obtaining full-text newspaper articles at near-zero cost. We focus on the GDELT Web News NGrams 3.0 dataset, which provides high-frequency updates of n-grams extracted from global online news sources. We provide Python code to reconstruct full-text articles from these n-grams by identifying overlapping textual fragments and intelligently merging them.
arXiv Detail & Related papers (2025-04-22T17:40:42Z)
Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page [0.0]
I introduce a new large-scale dataset of historical wire articles from U.S. Southern newspapers, spanning 1960-1975. Unlike prior work focusing on front-page content, this dataset captures articles across the entire newspaper, offering broader insight into mid-century Southern coverage.
arXiv Detail & Related papers (2025-02-17T14:57:47Z)
Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic). We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z)
POLygraph: Polish Fake News Dataset [0.37698262166557467]
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity.
arXiv Detail & Related papers (2024-07-01T15:45:21Z)
EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles [4.895830603263421]
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages.
arXiv Detail & Related papers (2024-06-18T13:43:22Z)
Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z)
X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language. Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z)
MegaWika: Millions of reports and their sources across 50 diverse languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z)
Identifying Informational Sources in News Articles [109.70475599552523]
We build the largest and widest-ranging annotated dataset of informational sources used in news writing. We introduce a novel task, source prediction, to study the compositionality of sources in news articles.
arXiv Detail & Related papers (2023-05-24T08:56:35Z)
evaluating bert and parsbert for analyzing persian advertisement data [0.0]
The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran. It presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it.
arXiv Detail & Related papers (2023-05-03T20:50:05Z)
Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z)
A High-Quality Multilingual Dataset for Structured Documentation Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain. We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.