Efficiently Adapting Pretrained Language Models To New Languages
- URL: http://arxiv.org/abs/2311.05741v2
- Date: Thu, 14 Dec 2023 23:37:14 GMT
- Title: Efficiently Adapting Pretrained Language Models To New Languages
- Authors: Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu
- Abstract summary: Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages.
We study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues.
- Score: 9.33333013114014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLM) exhibit sub-optimal performance on
low-resource languages, as the training data of these models is usually
dominated by English and other high-resource languages. Furthermore, it is
challenging to train models for low-resource languages, especially from
scratch, due to a lack of high quality training data. Adapting pretrained LLMs
reduces the need for data in the new language while also providing cross
lingual transfer capabilities. However, naively adapting to new languages leads
to catastrophic forgetting and poor tokenizer efficiency. In this work, we
study how to efficiently adapt any existing pretrained LLM to a new language
without running into these issues. In particular, we improve the encoding
efficiency of the tokenizer by adding new tokens from the target language and
study the data mixing recipe to mitigate forgetting. Our experiments on
adapting an English LLM to Hungarian and Thai show that our recipe can reach
better performance than open source models on the target language, with minimal
regressions on English.
Related papers
- Training Bilingual LMs with Data Constraints in the Targeted Language [20.262591969661447]
We study how to boost pretrained model performance in a data constrained target language by enlisting data from an auxiliary language for which high quality data is available.
We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language.
Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages.
arXiv Detail & Related papers (2024-11-20T02:27:40Z) - Exploring Pretraining via Active Forgetting for Improving Cross Lingual Transfer for Decoder Language Models [7.998168689120558]
Large Language Models (LLMs) demonstrate exceptional capabilities in a multitude of NLP tasks.
The efficacy of such models to languages other than English is often limited.
We show that LLMs pretrained with active forgetting are highly effective when adapting to new and unseen languages.
arXiv Detail & Related papers (2024-10-21T16:33:16Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Improving Language Plasticity via Pretraining with Active Forgetting [63.36484652568976]
We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.
Experiments with RoBERTa show that models pretrained with our forgetting mechanism demonstrate faster convergence during language adaptation.
arXiv Detail & Related papers (2023-07-03T17:12:44Z) - Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin [3.2039731457723604]
We aim to improve upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus.
Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements.
arXiv Detail & Related papers (2023-07-01T16:47:36Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting [50.24676567971536]
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages.
We apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages.
We conclude that with sufficient training data language adaptation can generalize well to diverse languages.
arXiv Detail & Related papers (2022-12-19T15:24:45Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.