Related papers: Targeted Multilingual Adaptation for Low-resource Language Families

Targeted Multilingual Adaptation for Low-resource Language Families

URL: http://arxiv.org/abs/2405.12413v1
Date: Mon, 20 May 2024 23:38:06 GMT
Title: Targeted Multilingual Adaptation for Low-resource Language Families
Authors: C. M. Downey, Terra Blevins, Dhwani Serai, Dwija Parikh, Shane Steinert-Threlkeld,
Abstract summary: We study best practices for adapting a pre-trained model to a language family. Our adapted models significantly outperform mono- and multilingual baselines. Low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages.
Score: 17.212424929235624
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The "massively-multilingual" training of multilingual models is known to limit their utility in any one language, and they perform particularly poorly on low-resource languages. However, there is evidence that low-resource languages can benefit from targeted multilinguality, where the model is trained on closely related languages. To test this approach more rigorously, we systematically study best practices for adapting a pre-trained model to a language family. Focusing on the Uralic family as a test case, we adapt XLM-R under various configurations to model 15 languages; we then evaluate the performance of each experimental setting on two downstream tasks and 11 evaluation languages. Our adapted models significantly outperform mono- and multilingual baselines. Furthermore, a regression analysis of hyperparameter effects reveals that adapted vocabulary size is relatively unimportant for low-resource languages, and that low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages. These results introduce new best practices for performing language adaptation in a targeted setting.

Related papers

Revisiting Multilingual Data Mixtures in Language Model Pretraining [20.282622416939997]
We study the impact of different multilingual data mixtures in pretraining large language models.<n>We find that combining English and multilingual data does not necessarily degrade the in-language performance of either group.<n>We do not observe a significant "curse of multilinguality" as the number of training languages increases.
arXiv Detail & Related papers (2025-10-29T20:46:03Z)
Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z)
LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language [2.9914612342004503]
This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks.
arXiv Detail & Related papers (2024-05-13T13:41:59Z)
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z)
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages [25.52470575274251]
We pre-train over 10,000 monolingual and multilingual language models for over 250 languages. We find that in moderation, adding multilingual data improves low-resource language modeling performance. As dataset sizes increase, adding multilingual data begins to hurt performance for both low-resource and high-resource languages.
arXiv Detail & Related papers (2023-11-15T18:47:42Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks. Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive. We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
Phylogeny-Inspired Adaptation of Multilingual Models to New Languages [43.62238334380897]
We show how we can use language phylogenetic information to improve cross-lingual transfer leveraging closely related languages. We perform adapter-based training on languages from diverse language families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic and semantic tasks.
arXiv Detail & Related papers (2022-05-19T15:49:19Z)
Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z)
Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance. Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together. Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.