Embedding structure matters: Comparing methods to adapt multilingual
vocabularies to new languages
- URL: http://arxiv.org/abs/2309.04679v2
- Date: Thu, 26 Oct 2023 21:26:16 GMT
- Title: Embedding structure matters: Comparing methods to adapt multilingual
vocabularies to new languages
- Authors: C.M. Downey, Terra Blevins, Nora Goldfine, Shane Steinert-Threlkeld
- Abstract summary: Pre-trained multilingual language models underpin a large portion of modern NLP tools outside of English.
We propose several simple techniques to replace a cross-lingual vocabulary with a compact, language-specific one.
- Score: 20.17308477850864
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pre-trained multilingual language models underpin a large portion of modern
NLP tools outside of English. A strong baseline for specializing these models
for specific languages is Language-Adaptive Pre-Training (LAPT). However,
retaining a large cross-lingual vocabulary and embedding matrix comes at
considerable excess computational cost during adaptation. In this study, we
propose several simple techniques to replace a cross-lingual vocabulary with a
compact, language-specific one. Namely, we address strategies for
re-initializing the token embedding matrix after vocabulary specialization. We
then provide a systematic experimental comparison of our techniques, in
addition to the recently-proposed Focus method. We demonstrate that: 1)
Embedding-replacement techniques in the monolingual transfer literature are
inadequate for adapting multilingual models. 2) Replacing cross-lingual
vocabularies with smaller specialized ones provides an efficient method to
improve performance in low-resource languages. 3) Simple embedding
re-initialization techniques based on script-wise sub-distributions rival
techniques such as Focus, which rely on similarity scores obtained from an
auxiliary model.
Related papers
- Adapters for Altering LLM Vocabularies: What Languages Benefit the Most? [23.83290627671739]
We propose a novel method for vocabulary adaptation using adapter modules that are trained to learn the optimal linear combination of existing embeddings.
VocADT offers a flexible and scalable solution without requiring external resources or language constraints.
We find that Latin-script languages and highly fragmented languages benefit the most from vocabulary adaptation.
arXiv Detail & Related papers (2024-10-12T20:45:24Z) - LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language [2.9914612342004503]
This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages.
We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension.
The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks.
arXiv Detail & Related papers (2024-05-13T13:41:59Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank [46.626315158735615]
Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties.
This presents a challenge for language varieties unfamiliar to these models, whose labeled emphand unlabeled data is too limited to train a monolingual model effectively.
We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings.
arXiv Detail & Related papers (2020-09-29T16:12:52Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.