Related papers: IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

URL: http://arxiv.org/abs/2109.04607v1
Date: Fri, 10 Sep 2021 01:27:51 GMT
Title: IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Authors: Fajri Koto and Jey Han Lau and Timothy Baldwin
Abstract summary: IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. We benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster.
Score: 33.46519116869276
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.

Related papers

HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization [50.27950279695363]
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages.<n>A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data.<n>We propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding.
arXiv Detail & Related papers (2025-04-21T19:40:32Z)
Self-Vocabularizing Training for Neural Machine Translation [15.700883057259931]
We observe that trained translation models are induced to use a byte-pair encoding subset (BPE) vocabulary iteration distinct from the original BPE vocabulary. We propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.
arXiv Detail & Related papers (2025-03-18T02:21:07Z)
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language. We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
A Comprehensive Comparison of Pre-training Language Models [0.5139874302398955]
We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results show that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for short text understanding.
arXiv Detail & Related papers (2021-06-22T02:12:29Z)
WangchanBERTa: Pretraining transformer-based Thai Language Models [2.186960190193067]
We pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size) We apply text processing rules that are specific to Thai most importantly preserving spaces. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance.
arXiv Detail & Related papers (2021-01-24T03:06:34Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT. Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z)
RobBERT: a Dutch RoBERTa-based Language Model [9.797319790710711]
We use RoBERTa to train a Dutch language model called RobBERT. We measure its performance on various tasks as well as the importance of the fine-tuning dataset size. RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets.
arXiv Detail & Related papers (2020-01-17T13:25:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.