Exposing propaganda: an analysis of stylistic cues comparing human
annotations and machine classification
- URL: http://arxiv.org/abs/2402.03780v3
- Date: Mon, 26 Feb 2024 14:07:20 GMT
- Title: Exposing propaganda: an analysis of stylistic cues comparing human
annotations and machine classification
- Authors: G\'eraud Faye, Benjamin Icard, Morgane Casanova, Julien Chanson,
Fran\c{c}ois Maine, Fran\c{c}ois Bancilhon, Guillaume Gadek, Guillaume
Gravier, Paul \'Egr\'e
- Abstract summary: This paper investigates the language of propaganda and its stylistic features.
It presents the PPN dataset, composed of news articles extracted from websites identified as propaganda sources.
We propose different NLP techniques to identify the cues used by the annotators, and to compare them with machine classification.
- Score: 0.7749297275724032
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the language of propaganda and its stylistic
features. It presents the PPN dataset, standing for Propagandist Pseudo-News, a
multisource, multilingual, multimodal dataset composed of news articles
extracted from websites identified as propaganda sources by expert agencies. A
limited sample from this set was randomly mixed with papers from the regular
French press, and their URL masked, to conduct an annotation-experiment by
humans, using 11 distinct labels. The results show that human annotators were
able to reliably discriminate between the two types of press across each of the
labels. We propose different NLP techniques to identify the cues used by the
annotators, and to compare them with machine classification. They include the
analyzer VAGO to measure discourse vagueness and subjectivity, a TF-IDF to
serve as a baseline, and four different classifiers: two RoBERTa-based models,
CATS using syntax, and one XGBoost combining syntactic and semantic features.
Related papers
- How Language Models Prioritize Contextual Grammatical Cues? [3.9790222241649587]
We investigate how language models handle gender agreement when multiple gender cue words are present.
Our findings reveal striking differences in how encoder-based and decoder-based models prioritize and use contextual information for their predictions.
arXiv Detail & Related papers (2024-10-04T14:09:05Z) - MemeMind at ArAIEval Shared Task: Spotting Persuasive Spans in Arabic Text with Persuasion Techniques Identification [0.10120650818458249]
This paper focuses on detecting propagandistic spans and persuasion techniques in Arabic text from tweets and news paragraphs.
Our approach achieved an F1 score of 0.2774, securing the 3rd position in the leaderboard of Task 1.
arXiv Detail & Related papers (2024-08-08T15:49:01Z) - A Multi-Label Dataset of French Fake News: Human and Machine Insights [0.5533610982157059]
We present a corpus of 100 documents, OBSINFOX, selected from 17 sources of French press considered unreliable by expert agencies.
By collecting more labels than usual, we can identify features that humans consider as characteristic of fake news.
We present a topic and genre analysis using Gate Cloud, indicative of the prevalence of satire-like text in the corpus.
arXiv Detail & Related papers (2024-03-24T11:29:55Z) - HuBERTopic: Enhancing Semantic Representation of HuBERT through
Self-supervision Utilizing Topic Model [62.995175485416]
We propose a new approach to enrich the semantic representation of HuBERT.
An auxiliary topic classification task is added to HuBERT by using topic labels as teachers.
Experimental results demonstrate that our method achieves comparable or better performance than the baseline in most tasks.
arXiv Detail & Related papers (2023-10-06T02:19:09Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - Description-Enhanced Label Embedding Contrastive Learning for Text
Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task.
Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets.
external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - BPGC at SemEval-2020 Task 11: Propaganda Detection in News Articles with
Multi-Granularity Knowledge Sharing and Linguistic Features based Ensemble
Learning [2.8913142991383114]
SemEval 2020 Task-11 aims to design automated systems for news propaganda detection.
Task-11 consists of two sub-tasks, namely, Span Identification and Technique Classification.
arXiv Detail & Related papers (2020-05-31T19:35:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.