KR-BERT: A Small-Scale Korean-Specific Language Model
- URL: http://arxiv.org/abs/2008.03979v2
- Date: Tue, 11 Aug 2020 06:21:50 GMT
- Title: KR-BERT: A Small-Scale Korean-Specific Language Model
- Authors: Sangah Lee, Hansol Jang, Yunmee Baik, Suzi Park, Hyopil Shin
- Abstract summary: We trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset.
Our model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the appearance of BERT, recent works including XLNet and RoBERTa
utilize sentence embedding models pre-trained by large corpora and a large
number of parameters. Because such models have large hardware and a huge amount
of data, they take a long time to pre-train. Therefore it is important to
attempt to make smaller models that perform comparatively. In this paper, we
trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and
dataset. Since Korean is one of the morphologically rich languages with poor
resources using non-Latin alphabets, it is also important to capture
language-specific linguistic phenomena that the Multilingual BERT model missed.
We tested several tokenizers including our BidirectionalWordPiece Tokenizer and
adjusted the minimal span of tokens for tokenization ranging from sub-character
level to character-level to construct a better vocabulary for our model. With
those adjustments, our KR-BERT model performed comparably and even better than
other existing pre-trained models using a corpus about 1/10 of the size.
Related papers
- From N-grams to Pre-trained Multilingual Models For Language Identification [0.35760345713831954]
We investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages.
For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages.
We show that Serengeti is a superior model across models: N-grams to Transformers on average.
arXiv Detail & Related papers (2024-10-11T11:35:57Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.