Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages
- URL: http://arxiv.org/abs/2109.10534v1
- Date: Wed, 22 Sep 2021 06:37:39 GMT
- Title: Role of Language Relatedness in Multilingual Fine-tuning of Language
Models: A Case Study in Indo-Aryan Languages
- Authors: Tejas Indulal Dhamecha, Rudra Murthy V, Samarth Bharadwaj, Karthik
Sankaranarayanan, Pushpak Bhattacharyya
- Abstract summary: We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning.
Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning.
- Score: 34.79533646549939
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We explore the impact of leveraging the relatedness of languages that belong
to the same family in NLP models using multilingual fine-tuning. We hypothesize
and validate that multilingual fine-tuning of pre-trained language models can
yield better performance on downstream NLP applications, compared to models
fine-tuned on individual languages. A first of its kind detailed study is
presented to track performance change as languages are added to a base language
in a graded and greedy (in the sense of best boost of performance) manner;
which reveals that careful selection of subset of related languages can
significantly improve performance than utilizing all related languages. The
Indo-Aryan (IA) language family is chosen for the study, the exact languages
being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script
barrier is crossed by simple rule-based transliteration of the text of all
languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL
and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource
languages, such as Oriya and Punjabi, are found to be the largest beneficiaries
of multilingual fine-tuning. Textual Entailment, Entity Classification, Section
Title Prediction, tasks of IndicGLUE and POS tagging form our test bed.
Compared to monolingual fine tuning we get relative performance improvement of
up to 150% in the downstream tasks. The surprise take-away is that for any
language there is a particular combination of other languages which yields the
best performance, and any additional language is in fact detrimental.
Related papers
- Linguistically-Informed Multilingual Instruction Tuning: Is There an Optimal Set of Languages to Tune? [0.0]
This study proposes a method to select languages for instruction tuning in a linguistically informed way.
We use a simple algorithm to choose diverse languages and test their effectiveness on various benchmarks and open-ended questions.
Our results show that this careful selection generally leads to better outcomes than choosing languages at random.
arXiv Detail & Related papers (2024-10-10T10:57:24Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - GradSim: Gradient-Based Language Grouping for Effective Multilingual
Training [13.730907708289331]
We propose GradSim, a language grouping method based on gradient similarity.
Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains.
Besides linguistic features, the topics of the datasets play an important role for language grouping.
arXiv Detail & Related papers (2023-10-23T18:13:37Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Multilingual BERT has an accent: Evaluating English influences on
fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages.
We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - Multilingual Text Classification for Dravidian Languages [4.264592074410622]
We propose a multilingual text classification framework for the Dravidian languages.
On the one hand, the framework used the LaBSE pre-trained model as the base model.
On the other hand, in view of the problem that the model cannot well recognize and utilize the correlation among languages, we further proposed a language-specific representation module.
arXiv Detail & Related papers (2021-12-03T04:26:49Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.