Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
- URL: http://arxiv.org/abs/2406.19759v2
- Date: Wed, 09 Oct 2024 12:20:33 GMT
- Title: Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
- Authors: Orgest Xhelili, Yihong Liu, Hinrich Schütze,
- Abstract summary: The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
- Score: 50.27950279695363
- License:
- Abstract: Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and $\textbf{South+East Asian Languages}$, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available at \url{https://github.com/cisnlp/Transliteration-PPA}.
Related papers
- An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models [26.72394783468532]
We propose an textitefficient method to study transfer language influence in zero-shot performance on another target language.
Our findings suggest that some languages do not largely affect others while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages.
arXiv Detail & Related papers (2024-03-29T09:52:18Z) - Cross-Lingual Transfer from Related Languages: Treating Low-Resource
Maltese as Multilingual Code-Switching [9.435669487585917]
We focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script.
We present a novel dataset annotated with word-level etymology.
We show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
arXiv Detail & Related papers (2024-01-30T11:04:36Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.