Does Transliteration Help Multilingual Language Modeling?
- URL: http://arxiv.org/abs/2201.12501v3
- Date: Sun, 30 Jul 2023 22:10:44 GMT
- Title: Does Transliteration Help Multilingual Language Modeling?
- Authors: Ibraheem Muhammad Moosa, Mahmud Elahi Akhter, Ashfia Binte Habib
- Abstract summary: We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Script diversity presents a challenge to Multilingual Language Models (MLLM)
by reducing lexical overlap among closely related languages. Therefore,
transliterating closely related languages that use different writing scripts to
a common script may improve the downstream task performance of MLLMs. We
empirically measure the effect of transliteration on MLLMs in this context. We
specifically focus on the Indic languages, which have the highest script
diversity in the world, and we evaluate our models on the IndicGLUE benchmark.
We perform the Mann-Whitney U test to rigorously verify whether the effect of
transliteration is significant or not. We find that transliteration benefits
the low-resource languages without negatively affecting the comparatively
high-resource languages. We also measure the cross-lingual representation
similarity of the models using centered kernel alignment on parallel sentences
from the FLORES-101 dataset. We find that for parallel sentences across
different languages, the transliteration-based model learns sentence
representations that are more similar.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Could We Have Had Better Multilingual LLMs If English Was Not the Central Language? [4.655168524016426]
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on.
Our study delves into Llama2's translation capabilities.
Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
arXiv Detail & Related papers (2024-02-21T16:32:38Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.