RoBERTweet: A BERT Language Model for Romanian Tweets
- URL: http://arxiv.org/abs/2306.06598v1
- Date: Sun, 11 Jun 2023 06:11:56 GMT
- Title: RoBERTweet: A BERT Language Model for Romanian Tweets
- Authors: Iulian-Marius T\u{a}iatu, Andrei-Marius Avram, Dumitru-Clementin
Cercel and Florin Pop
- Abstract summary: This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets.
The corpus used for pre-training the models represents a novelty for the Romanian NLP community.
Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs.
- Score: 0.15293427903448023
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Developing natural language processing (NLP) systems for social media
analysis remains an important topic in artificial intelligence research. This
article introduces RoBERTweet, the first Transformer architecture trained on
Romanian tweets. Our RoBERTweet comes in two versions, following the base and
large architectures of BERT. The corpus used for pre-training the models
represents a novelty for the Romanian NLP community and consists of all tweets
collected from 2008 to 2022. Experiments show that RoBERTweet models outperform
the previous general-domain Romanian and multilingual language models on three
NLP tasks with tweet inputs: emotion detection, sexist language identification,
and named entity recognition. We make our models and the newly created corpus
of Romanian tweets freely available.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - RobBERT-2022: Updating a Dutch Language Model to Account for Evolving
Language Use [9.797319790710711]
We update RobBERT, a state-of-the-art Dutch language model, which was trained in 2019.
First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus.
To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.
arXiv Detail & Related papers (2022-11-15T14:55:53Z) - TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for
Multilingual Tweet Representations at Twitter [31.698196219228024]
We present TwHIN-BERT, a multilingual language model productionized at Twitter.
Our model is trained on 7 billion tweets covering over 100 distinct languages.
We evaluate our model on various multilingual social recommendation and semantic understanding tasks.
arXiv Detail & Related papers (2022-09-15T19:01:21Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - RoBERTuito: a pre-trained language model for social media text in
Spanish [1.376408511310322]
RoBERTuito is a pre-trained language model for user-generated content in Spanish.
We trained RoBERTuito on 500 million tweets in Spanish.
arXiv Detail & Related papers (2021-11-18T00:10:25Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - TweetBERT: A Pretrained Language Representation Model for Twitter Text
Analysis [0.0]
We introduce two TweetBERT models, which are domain specific language presentation models, pre-trained on millions of tweets.
We show that the TweetBERT models significantly outperform the traditional BERT models in Twitter text mining tasks by more than 7% on each Twitter dataset.
arXiv Detail & Related papers (2020-10-17T00:45:02Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z) - BERTweet: A pre-trained language model for English Tweets [14.575661723724005]
We present BERTweet, the first public large-scale pre-trained language model for English Tweets.
BERTweet is trained using the RoBERTa pre-training procedure.
We release BERTweet under the MIT License to facilitate future research and applications on Tweet data.
arXiv Detail & Related papers (2020-05-20T17:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.