Exploiting Language Relatedness for Low Web-Resource Language Model
Adaptation: An Indic Languages Study
- URL: http://arxiv.org/abs/2106.03958v2
- Date: Wed, 9 Jun 2021 10:16:47 GMT
- Title: Exploiting Language Relatedness for Low Web-Resource Language Model
Adaptation: An Indic Languages Study
- Authors: Yash Khemchandani, Sarvesh Mehtani, Vaidehi Patil, Abhijeet Awasthi,
Partha Talukdar, Sunita Sarawagi
- Abstract summary: We argue that relatedness among languages in a language family may be exploited to overcome some of the corpora limitations of LRLs.
We focus on Indian languages, and exploit relatedness along two dimensions: (1) script (since many Indic scripts originated from the Brahmic script) and (2) sentence structure.
- Score: 14.34516262614775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research in multilingual language models (LM) has demonstrated their
ability to effectively handle multiple languages in a single model. This holds
promise for low web-resource languages (LRL) as multilingual models can enable
transfer of supervision from high resource languages to LRLs. However,
incorporating a new language in an LM still remains a challenge, particularly
for languages with limited corpora and in unseen scripts. In this paper we
argue that relatedness among languages in a language family may be exploited to
overcome some of the corpora limitations of LRLs, and propose RelateLM. We
focus on Indian languages, and exploit relatedness along two dimensions: (1)
script (since many Indic scripts originated from the Brahmic script), and (2)
sentence structure. RelateLM uses transliteration to convert the unseen script
of limited LRL text into the script of a Related Prominent Language (RPL)
(Hindi in our case). While exploiting similar sentence structures, RelateLM
utilizes readily available bilingual dictionaries to pseudo translate RPL text
into LRL corpora. Experiments on multiple real-world benchmark datasets provide
validation to our hypothesis that using a related language as pivot, along with
transliteration and pseudo translation based data augmentation, can be an
effective way to adapt LMs for LRLs, rather than direct training or pivoting
through English.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Cross-Lingual Transfer Robustness to Lower-Resource Languages on Adversarial Datasets [4.653113033432781]
Cross-lingual transfer capabilities of Multilingual Language Models (MLLMs) are investigated.
Our research provides valuable insights into cross-lingual transfer and its implications for NLP applications.
arXiv Detail & Related papers (2024-03-29T08:47:15Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.