Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
- URL: http://arxiv.org/abs/2205.09634v1
- Date: Thu, 19 May 2022 15:49:19 GMT
- Title: Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
- Authors: Fahim Faisal, Antonios Anastasopoulos
- Abstract summary: We show how we can use language phylogenetic information to improve cross-lingual transfer leveraging closely related languages.
We perform adapter-based training on languages from diverse language families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic and semantic tasks.
- Score: 43.62238334380897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained multilingual models, trained on dozens of languages, have
delivered promising results due to cross-lingual learning capabilities on
variety of language tasks. Further adapting these models to specific languages,
especially ones unseen during pre-training, is an important goal towards
expanding the coverage of language technologies. In this study, we show how we
can use language phylogenetic information to improve cross-lingual transfer
leveraging closely related languages in a structured, linguistically-informed
manner. We perform adapter-based training on languages from diverse language
families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic
and semantic tasks, obtaining more than 20% relative performance improvements
over strong commonly used baselines, especially on languages unseen during
pre-training.
Related papers
- Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs [24.59074126514084]
We study multilingual unlearning using the Aya-Expanse 8B model under two settings: data unlearning and concept unlearning.<n>We extend benchmarks for factual knowledge and stereotypes to ten languages through translation.<n>Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages.
arXiv Detail & Related papers (2026-01-09T08:59:42Z) - LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning [49.22807995935406]
Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs)<n>Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data.<n>We propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space.
arXiv Detail & Related papers (2025-11-13T12:02:32Z) - Revisiting Multilingual Data Mixtures in Language Model Pretraining [20.282622416939997]
We study the impact of different multilingual data mixtures in pretraining large language models.<n>We find that combining English and multilingual data does not necessarily degrade the in-language performance of either group.<n>We do not observe a significant "curse of multilinguality" as the number of training languages increases.
arXiv Detail & Related papers (2025-10-29T20:46:03Z) - One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers [43.91164842493269]
We study what relatively cheap interventions early on in training improve "language plasticity"<n>We propose using a universal tokenizer that is trained for more languages than the primary pretraining languages.
arXiv Detail & Related papers (2025-06-12T14:47:13Z) - Targeted Multilingual Adaptation for Low-resource Language Families [17.212424929235624]
We study best practices for adapting a pre-trained model to a language family.
Our adapted models significantly outperform mono- and multilingual baselines.
Low-resource languages can be aggressively up-sampled during training at little detriment to performance in high-resource languages.
arXiv Detail & Related papers (2024-05-20T23:38:06Z) - LlamaTurk: Adapting Open-Source Generative Large Language Models for Low-Resource Language [2.9914612342004503]
This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages.
We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension.
The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks.
arXiv Detail & Related papers (2024-05-13T13:41:59Z) - Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences.
It aligns the internal sentence representations across different languages via multilingual contrastive learning.
Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Adapting Multilingual Speech Representation Model for a New,
Underresourced Language through Multilingual Fine-tuning and Continued
Pretraining [2.3513645401551333]
We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language.
Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language.
We find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance.
arXiv Detail & Related papers (2023-01-18T03:57:53Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - Cross-Lingual Language Model Meta-Pretraining [21.591492094502424]
We propose a cross-lingual language model meta-pretraining, which learns the two abilities in different training phases.
Our method improves both generalization and cross-lingual transfer, and produces better-aligned representations across different languages.
arXiv Detail & Related papers (2021-09-23T03:47:44Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.