RoBERTuito: a pre-trained language model for social media text in
Spanish
- URL: http://arxiv.org/abs/2111.09453v1
- Date: Thu, 18 Nov 2021 00:10:25 GMT
- Title: RoBERTuito: a pre-trained language model for social media text in
Spanish
- Authors: Juan Manuel P\'erez, Dami\'an A. Furman, Laura Alonso Alemany, Franco
Luque
- Abstract summary: RoBERTuito is a pre-trained language model for user-generated content in Spanish.
We trained RoBERTuito on 500 million tweets in Spanish.
- Score: 1.376408511310322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since BERT appeared, Transformer language models and transfer learning have
become state-of-the-art for Natural Language Understanding tasks. Recently,
some works geared towards pre-training, specially-crafted models for particular
domains, such as scientific papers, medical documents, and others. In this
work, we present RoBERTuito, a pre-trained language model for user-generated
content in Spanish. We trained RoBERTuito on 500 million tweets in Spanish.
Experiments on a benchmark of 4 tasks involving user-generated text showed that
RoBERTuito outperformed other pre-trained language models for Spanish. In order
to help further research, we make RoBERTuito publicly available at the
HuggingFace model hub.
Related papers
- Spanish Pre-trained BERT Model and Evaluation Data [0.0]
We present a BERT-based language model pre-trained exclusively on Spanish data.
We also compiled several tasks specifically for the Spanish language in a single repository.
We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.
arXiv Detail & Related papers (2023-08-06T00:16:04Z) - RoBERTweet: A BERT Language Model for Romanian Tweets [0.15293427903448023]
This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets.
The corpus used for pre-training the models represents a novelty for the Romanian NLP community.
Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs.
arXiv Detail & Related papers (2023-06-11T06:11:56Z) - RobBERT-2022: Updating a Dutch Language Model to Account for Evolving
Language Use [9.797319790710711]
We update RobBERT, a state-of-the-art Dutch language model, which was trained in 2019.
First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus.
To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.
arXiv Detail & Related papers (2022-11-15T14:55:53Z) - TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter [31.698196219228024]
We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
arXiv Detail & Related papers (2022-09-15T19:01:21Z) - RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining [117.56261821197741]
We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
arXiv Detail & Related papers (2022-04-08T09:18:59Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Scribosermo: Fast Speech-to-Text models for German and other Languages [69.7571480246023]
This paper presents Speech-to-Text models for German, as well as for Spanish and French with special features.
They are small and run in real-time on microcontrollers like a RaspberryPi.
Using a pretrained English model, they can be trained on consumer-grade hardware with a relatively small dataset.
arXiv Detail & Related papers (2021-10-15T10:10:34Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - Pretrained Language Model Embryology: The Birth of ALBERT [68.5801642674541]
We investigate the developmental process from a set of randomly parameters to a totipotent language model.
Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining.
These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.
arXiv Detail & Related papers (2020-10-06T05:15:39Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.