KinyaBERT: a Morphology-aware Kinyarwanda Language Model
- URL: http://arxiv.org/abs/2203.08459v2
- Date: Thu, 17 Mar 2022 12:35:21 GMT
- Title: KinyaBERT: a Morphology-aware Kinyarwanda Language Model
- Authors: Antoine Nzeyimana, Andre Niyongabo Rubungo
- Abstract summary: Unsupervised sub-word tokenization methods are sub-optimal at handling morphologically rich languages.
We propose a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality.
We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT.
- Score: 1.2183405753834562
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained language models such as BERT have been successful at tackling
many natural language processing tasks. However, the unsupervised sub-word
tokenization methods commonly used in these models (e.g., byte-pair encoding -
BPE) are sub-optimal at handling morphologically rich languages. Even given a
morphological analyzer, naive sequencing of morphemes into a standard BERT
architecture is inefficient at capturing morphological compositionality and
expressing word-relative syntactic regularities. We address these challenges by
proposing a simple yet effective two-tier BERT architecture that leverages a
morphological analyzer and explicitly represents morphological
compositionality. Despite the success of BERT, most of its evaluations have
been conducted on high-resource languages, obscuring its applicability on
low-resource languages. We evaluate our proposed method on the low-resource
morphologically rich Kinyarwanda language, naming the proposed model
architecture KinyaBERT. A robust set of experimental results reveal that
KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity
recognition task and by 4.3% in average score of a machine-translated GLUE
benchmark. KinyaBERT fine-tuning has better convergence and achieves more
robust results on multiple tasks even in the presence of translation noise.
Related papers
- Comparison of Pre-trained Language Models for Turkish Address Parsing [0.0]
We focus on Turkish maps data and thoroughly evaluate both multilingual and Turkish based BERT, DistilBERT, ELECTRA and RoBERTa.
We also propose a MultiLayer Perceptron (MLP) for fine-tuning BERT in addition to the standard approach of one-layer fine-tuning.
arXiv Detail & Related papers (2023-06-24T12:09:43Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Reranking Machine Translation Hypotheses with Structured and Web-based
Language Models [11.363601836199331]
Two structured language models are applied for N-best rescoring.
We find that the combination of these language models increases the BLEU score up to 1.6% absolutely on blind test sets.
arXiv Detail & Related papers (2021-04-25T22:09:03Z) - WangchanBERTa: Pretraining transformer-based Thai Language Models [2.186960190193067]
We pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size)
We apply text processing rules that are specific to Thai most importantly preserving spaces.
We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance.
arXiv Detail & Related papers (2021-01-24T03:06:34Z) - Enhancing deep neural networks with morphological information [0.0]
We analyse the effect of adding morphological features to LSTM and BERT models.
Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task.
arXiv Detail & Related papers (2020-11-24T22:35:44Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.