Related papers: BERTuit: Understanding Spanish language in Twitter through a native transformer

BERTuit: Understanding Spanish language in Twitter through a native transformer

URL: http://arxiv.org/abs/2204.03465v1
Date: Thu, 7 Apr 2022 14:28:51 GMT
Title: BERTuit: Understanding Spanish language in Twitter through a native transformer
Authors: Javier Huertas-Tato and Alejandro Martin and David Camacho
Abstract summary: We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
Score: 70.77033762320572
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The appearance of complex attention-based language models such as BERT, Roberta or GPT-3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever-changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state-of-the-art multilingual language models are applied to this scenario, language specific nuances use to get lost in translation. To face these challenges we present \textbf{BERTuit}, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: a zero-shot methodology to visualize groups of hoaxes and profiling authors spreading disinformation. Misinformation spreads wildly on platforms such as Twitter in languages other than English, meaning performance of transformers may suffer when transferred outside English speaking communities.

Related papers

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z)
Prompting Towards Alleviating Code-Switched Data Scarcity in Under-Resourced Languages with GPT as a Pivot [1.3741556944830366]
This study prompted GPT 3.5 to generate Afrikaans--English and Yoruba--English code-switched sentences. The quality of generated sentences for languages using non-Latin scripts, like Yoruba, is considerably lower when compared with the high Afrikaans-English success rate. We propose a framework for augmenting the diversity of synthetically generated code-switched data using GPT.
arXiv Detail & Related papers (2024-04-26T07:44:44Z)
MuCoT: Multilingual Contrastive Training for Question-Answering in Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages. We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model. Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z)
Exploiting BERT For Multimodal Target SentimentClassification Through Input Space Translation [75.82110684355979]
We introduce a two-stream model that translates images in input space using an object-aware transformer. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. We achieve state-of-the-art performance on two multimodal Twitter datasets.
arXiv Detail & Related papers (2021-08-03T18:02:38Z)
Syntax-augmented Multilingual BERT for Cross-lingual Transfer [37.99210035238424]
This work shows that explicitly providing language syntax and training mBERT helps cross-lingual transfer. Experiment results show that syntax-augmented mBERT improves cross-lingual transfer on popular benchmarks.
arXiv Detail & Related papers (2021-06-03T21:12:50Z)
Semi-automatic Generation of Multilingual Datasets for Stance Detection in Twitter [9.359018642178917]
This paper presents a method to obtain multilingual datasets for stance detection in Twitter. We leverage user-based information to semi-automatically label large amounts of tweets.
arXiv Detail & Related papers (2021-01-28T13:05:09Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation [77.69102711230248]
We propose the use of a multilingual transformer model, that we pre-train over English tweets and apply data-augmentation using automatic translation to adapt the model to non-English languages. Our experiments in French, Spanish, German and Italian suggest that the proposed technique is an efficient way to improve the results of the transformers over small corpora of tweets in a non-English language.
arXiv Detail & Related papers (2020-10-07T15:44:55Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.