Efficiently Adapting Pretrained Language Models To New Languages
- URL: http://arxiv.org/abs/2311.05741v2
- Date: Thu, 14 Dec 2023 23:37:14 GMT
- Title: Efficiently Adapting Pretrained Language Models To New Languages
- Authors: Zoltan Csaki, Pian Pawakapan, Urmish Thakker, Qiantong Xu
- Abstract summary: Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages.
We study how to efficiently adapt any existing pretrained LLM to a new language without running into these issues.
- Score: 9.33333013114014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large language models (LLM) exhibit sub-optimal performance on
low-resource languages, as the training data of these models is usually
dominated by English and other high-resource languages. Furthermore, it is
challenging to train models for low-resource languages, especially from
scratch, due to a lack of high quality training data. Adapting pretrained LLMs
reduces the need for data in the new language while also providing cross
lingual transfer capabilities. However, naively adapting to new languages leads
to catastrophic forgetting and poor tokenizer efficiency. In this work, we
study how to efficiently adapt any existing pretrained LLM to a new language
without running into these issues. In particular, we improve the encoding
efficiency of the tokenizer by adding new tokens from the target language and
study the data mixing recipe to mitigate forgetting. Our experiments on
adapting an English LLM to Hungarian and Thai show that our recipe can reach
better performance than open source models on the target language, with minimal
regressions on English.
Related papers
- Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Extending LLMs to New Languages: A Case Study of Llama and Persian Adaptation [36.92567530333872]
We study adding a new language, i.e. Persian, to a large language model (LLMs)
We employ a multi-stage approach involving pretraining on monolingual Persian data.
We evaluate the model's performance at each stage on generation and classification tasks.
arXiv Detail & Related papers (2024-12-17T23:18:06Z) - How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)
In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.
We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z) - Training Bilingual LMs with Data Constraints in the Targeted Language [17.623676545426477]
We study how to boost pretrained model performance in a target language with insufficient pretraining data.
We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language.
arXiv Detail & Related papers (2024-11-20T02:27:40Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Improving Language Plasticity via Pretraining with Active Forgetting [63.36484652568976]
We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages.
Experiments with RoBERTa show that models pretrained with our forgetting mechanism demonstrate faster convergence during language adaptation.
arXiv Detail & Related papers (2023-07-03T17:12:44Z) - Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin [3.2039731457723604]
We aim to improve upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus.
Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements.
arXiv Detail & Related papers (2023-07-01T16:47:36Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.