Transferring Monolingual Model to Low-Resource Language: The Case of
Tigrinya
- URL: http://arxiv.org/abs/2006.07698v2
- Date: Fri, 19 Jun 2020 15:00:02 GMT
- Title: Transferring Monolingual Model to Low-Resource Language: The Case of
Tigrinya
- Authors: Abrhalei Tela, Abraham Woubie, and Ville Hautamaki
- Abstract summary: We propose a cost-effective transfer learning method to adopt a strong source language model.
With only 10k examples of the given Tigrinya sentiment analysis dataset, English XLNet has achieved 78.88% F1-Score.
Fine-tuning (English) XLNet model on the CLS dataset has promising results compared to mBERT and even outperformed mBERT for one dataset of the Japanese language.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In recent years, transformer models have achieved great success in natural
language processing (NLP) tasks. Most of the current state-of-the-art NLP
results are achieved by using monolingual transformer models, where the model
is pre-trained using a single language unlabelled text corpus. Then, the model
is fine-tuned to the specific downstream task. However, the cost of
pre-training a new transformer model is high for most languages. In this work,
we propose a cost-effective transfer learning method to adopt a strong source
language model, trained from a large monolingual corpus to a low-resource
language. Thus, using XLNet language model, we demonstrate competitive
performance with mBERT and a pre-trained target language model on the
cross-lingual sentiment (CLS) dataset and on a new sentiment analysis dataset
for low-resourced language Tigrinya. With only 10k examples of the given
Tigrinya sentiment analysis dataset, English XLNet has achieved 78.88% F1-Score
outperforming BERT and mBERT by 10% and 7%, respectively. More interestingly,
fine-tuning (English) XLNet model on the CLS dataset has promising results
compared to mBERT and even outperformed mBERT for one dataset of the Japanese
language.
Related papers
- Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Learning Compact Metrics for MT [21.408684470261342]
We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model.
We show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck.
Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT's performance using only a third of its parameters.
arXiv Detail & Related papers (2021-10-12T20:39:35Z) - Adapting Monolingual Models: Data can be Scarce when Language Similarity
is High [3.249853429482705]
We investigate the performance of zero-shot transfer learning with as little data as possible.
We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties.
With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance.
arXiv Detail & Related papers (2021-05-06T17:43:40Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Model Selection for Cross-Lingual Transfer [15.197350103781739]
We propose a machine learning approach to model selection that uses the fine-tuned model's own internal representations to predict its cross-lingual capabilities.
In extensive experiments we find that this method consistently selects better models than English validation data across twenty five languages.
arXiv Detail & Related papers (2020-10-13T02:36:48Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.