Deriving Disinformation Insights from Geolocalized Twitter Callouts
- URL: http://arxiv.org/abs/2108.03067v1
- Date: Fri, 6 Aug 2021 11:39:05 GMT
- Title: Deriving Disinformation Insights from Geolocalized Twitter Callouts
- Authors: David Tuxworth, Dimosthenis Antypas, Luis Espinosa-Anke, Jose
Camacho-Collados, Alun Preece, David Rogers
- Abstract summary: This paper demonstrates a two-stage method for deriving insights from social media data relating to disinformation by applying a combination of geospatial classification and embedding-based language modelling.
Twitter data is classified into European and non-European sets using BERT.
Word2vec is applied to the classified texts resulting in Eurocentric, non-Eurocentric and global representations of the data for the three target languages.
- Score: 7.951685935253415
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper demonstrates a two-stage method for deriving insights from social
media data relating to disinformation by applying a combination of geospatial
classification and embedding-based language modelling across multiple
languages. In particular, the analysis in centered on Twitter and
disinformation for three European languages: English, French and Spanish.
Firstly, Twitter data is classified into European and non-European sets using
BERT. Secondly, Word2vec is applied to the classified texts resulting in
Eurocentric, non-Eurocentric and global representations of the data for the
three target languages. This comparative analysis demonstrates not only the
efficacy of the classification method but also highlights geographic, temporal
and linguistic differences in the disinformation-related media. Thus, the
contributions of the work are threefold: (i) a novel language-independent
transformer-based geolocation method; (ii) an analytical approach that exploits
lexical specificity and word embeddings to interrogate user-generated content;
and (iii) a dataset of 36 million disinformation related tweets in English,
French and Spanish.
Related papers
- Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - OPSD: an Offensive Persian Social media Dataset and its baseline evaluations [2.356562319390226]
This paper introduces two offensive datasets for Persian language.
The first dataset comprises annotations provided by domain experts, while the second consists of a large collection of unlabeled data obtained through web crawling.
The obtained F1-scores for the three-class and two-class versions of the dataset were 76.9% and 89.9% for XLM-RoBERTa, respectively.
arXiv Detail & Related papers (2024-04-08T14:08:56Z) - Cross-lingual Offensive Language Detection: A Systematic Review of
Datasets, Transfer Approaches and Challenges [10.079109184645478]
This survey presents a systematic and comprehensive exploration of Cross-Lingual Transfer Learning techniques in offensive language detection in social media.
Our study stands as the first holistic overview to focus exclusively on the cross-lingual scenario in this domain.
arXiv Detail & Related papers (2024-01-17T14:44:27Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - Multi-EuP: The Multilingual European Parliament Dataset for Analysis of
Bias in Information Retrieval [62.82448161570428]
This dataset is designed to investigate fairness in a multilingual information retrieval context.
It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages.
It offers rich demographic information associated with its documents, facilitating the study of demographic bias.
arXiv Detail & Related papers (2023-11-03T12:29:11Z) - X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs [55.80189506270598]
X-PARADE is the first cross-lingual dataset of paragraph-level information divergences.
Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language.
Aligned paragraphs are sourced from Wikipedia pages in different languages.
arXiv Detail & Related papers (2023-09-16T04:34:55Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - Multilingual Irony Detection with Dependency Syntax and Neural Models [61.32653485523036]
It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme.
The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.
arXiv Detail & Related papers (2020-11-11T11:22:05Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.