BERTweet: A pre-trained language model for English Tweets
- URL: http://arxiv.org/abs/2005.10200v2
- Date: Mon, 5 Oct 2020 10:00:24 GMT
- Title: BERTweet: A pre-trained language model for English Tweets
- Authors: Dat Quoc Nguyen, Thanh Vu and Anh Tuan Nguyen
- Abstract summary: We present BERTweet, the first public large-scale pre-trained language model for English Tweets.
BERTweet is trained using the RoBERTa pre-training procedure.
We release BERTweet under the MIT License to facilitate future research and applications on Tweet data.
- Score: 14.575661723724005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present BERTweet, the first public large-scale pre-trained language model
for English Tweets. Our BERTweet, having the same architecture as BERT-base
(Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu
et al., 2019). Experiments show that BERTweet outperforms strong baselines
RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better
performance results than the previous state-of-the-art models on three Tweet
NLP tasks: Part-of-speech tagging, Named-entity recognition and text
classification. We release BERTweet under the MIT License to facilitate future
research and applications on Tweet data. Our BERTweet is available at
https://github.com/VinAIResearch/BERTweet
Related papers
- SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction [65.1590372072555]
We introduce SHuBERT, a self-supervised transformer encoder that learns strong representations from American Sign Language (ASL) video content.
Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input.
SHuBERT achieves state-of-the-art performance across multiple benchmarks.
arXiv Detail & Related papers (2024-11-25T03:13:08Z) - RoBERTweet: A BERT Language Model for Romanian Tweets [0.15293427903448023]
This article introduces RoBERTweet, the first Transformer architecture trained on Romanian tweets.
The corpus used for pre-training the models represents a novelty for the Romanian NLP community.
Experiments show that RoBERTweet models outperform the previous general-domain Romanian and multilingual language models on three NLP tasks with tweet inputs.
arXiv Detail & Related papers (2023-06-11T06:11:56Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi [0.7874708385247353]
This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
arXiv Detail & Related papers (2022-11-21T05:15:48Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with
Effective Domain-Specific Vocabulary Initialization [33.46519116869276]
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter.
We benchmark different ways of initializing the BERT embedding layer for new word types.
We find that initializing with the average BERT subword embedding makes pretraining five times faster.
arXiv Detail & Related papers (2021-09-10T01:27:51Z) - Pre-Training BERT on Arabic Tweets: Practical Considerations [11.087099497830552]
We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing.
All are intended to support Arabic dialects and social media.
New models achieve new state-of-the-art results on several downstream tasks.
arXiv Detail & Related papers (2021-02-21T20:51:33Z) - Syntax-Enhanced Pre-trained Model [49.1659635460369]
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa.
Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages.
We present a model that utilizes the syntax of text in both pre-training and fine-tuning stages.
arXiv Detail & Related papers (2020-12-28T06:48:04Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - RobBERT: a Dutch RoBERTa-based Language Model [9.797319790710711]
We use RoBERTa to train a Dutch language model called RobBERT.
We measure its performance on various tasks as well as the importance of the fine-tuning dataset size.
RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets.
arXiv Detail & Related papers (2020-01-17T13:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.