Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter
- URL: http://arxiv.org/abs/2101.11978v1
- Date: Thu, 28 Jan 2021 13:05:09 GMT
- Title: Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter
- Authors: Elena Zotova, Rodrigo Agerri, German Rigau
- Abstract summary: This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
- Score: 9.359018642178917
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Popular social media networks provide the perfect environment to study the
opinions and attitudes expressed by users. While interactions in social media
such as Twitter occur in many natural languages, research on stance detection
(the position or attitude expressed with respect to a specific topic) within
the Natural Language Processing field has largely been done for English.
Although some efforts have recently been made to develop annotated data in
other languages, there is a telling lack of resources to facilitate
multilingual and crosslingual research on stance detection. This is partially
due to the fact that manually annotating a corpus of social media texts is a
difficult, slow and costly process. Furthermore, as stance is a highly domain-
and topic-specific phenomenon, the need for annotated data is specially
demanding. As a result, most of the manually labeled resources are hindered by
their relatively small size and skewed class distribution. This paper presents
a method to obtain multilingual datasets for stance detection in Twitter.
Instead of manually annotating on a per tweet basis, we leverage user-based
information to semi-automatically label large amounts of tweets. Empirical
monolingual and cross-lingual experimentation and qualitative analysis show
that our method helps to overcome the aforementioned difficulties to build
large, balanced and multilingual labeled corpora. We believe that our method
can be easily adapted to easily generate labeled social media data for other
Natural Language Processing tasks and domains.
Related papers
- M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets [4.478789600295492]
This paper transforms an existing textual Twitter sentiment dataset into a multimodal format through a straightforward curation process.
Our work opens up new avenues for sentiment-related research within the research community.
arXiv Detail & Related papers (2024-04-02T09:11:58Z) - Automated stance detection in complex topics and small languages: the
challenging case of immigration in polarizing news media [0.0]
This paper explores the applicability of large language models for automated stance detection in a challenging scenario.
It involves a morphologically complex, lower-resource language, and a socio-culturally complex topic, immigration.
If the approach works in this case, it can be expected to perform as well or better in less demanding scenarios.
arXiv Detail & Related papers (2023-05-22T13:56:35Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Relational Embeddings for Language Independent Stance Detection [4.492444446637856]
We propose a new method to leverage social information such as friends and retweets by generating relational embeddings.
Our method can be applied to any language and target without any manual tuning.
arXiv Detail & Related papers (2022-10-11T18:13:43Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - Few-Shot Cross-Lingual Stance Detection with Sentiment-Based
Pre-Training [32.800766653254634]
We present the most comprehensive study of cross-lingual stance detection to date.
We use 15 diverse datasets in 12 languages from 6 language families.
For our experiments, we build on pattern-exploiting training, proposing the addition of a novel label encoder.
arXiv Detail & Related papers (2021-09-13T15:20:06Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.