BERTuit: Understanding Spanish language in Twitter through a native
transformer
- URL: http://arxiv.org/abs/2204.03465v1
- Date: Thu, 7 Apr 2022 14:28:51 GMT
- Title: BERTuit: Understanding Spanish language in Twitter through a native
transformer
- Authors: Javier Huertas-Tato and Alejandro Martin and David Camacho
- Abstract summary: We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
- Score: 70.77033762320572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The appearance of complex attention-based language models such as BERT,
Roberta or GPT-3 has allowed to address highly complex tasks in a plethora of
scenarios. However, when applied to specific domains, these models encounter
considerable difficulties. This is the case of Social Networks such as Twitter,
an ever-changing stream of information written with informal and complex
language, where each message requires careful evaluation to be understood even
by humans given the important role that context plays. Addressing tasks in this
domain through Natural Language Processing involves severe challenges. When
powerful state-of-the-art multilingual language models are applied to this
scenario, language specific nuances use to get lost in translation. To face
these challenges we present \textbf{BERTuit}, the larger transformer proposed
so far for Spanish language, pre-trained on a massive dataset of 230M Spanish
tweets using RoBERTa optimization. Our motivation is to provide a powerful
resource to better understand Spanish Twitter and to be used on applications
focused on this social network, with special emphasis on solutions devoted to
tackle the spreading of misinformation in this platform. BERTuit is evaluated
on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very
competitive multilingual transformers. The utility of our approach is shown
with applications, in this case: a zero-shot methodology to visualize groups of
hoaxes and profiling authors spreading disinformation.
Misinformation spreads wildly on platforms such as Twitter in languages other
than English, meaning performance of transformers may suffer when transferred
outside English speaking communities.
Related papers
- Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.
Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.
We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot [1.3741556944830366]
This study prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences.
The quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate.
We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT.
arXiv Detail & Related papers (2024-04-26T07:44:44Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z) - Exploiting BERT For Multimodal Target SentimentClassification Through
Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer.
We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model.
We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z) - Semi-automatic Generation of Multilingual Datasets for Stance Detection
in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter.
We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Improving Sentiment Analysis over non-English Tweets using Multilingual
Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages.
Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.