Cross-Lingual Transfer from Related Languages: Treating Low-Resource
Maltese as Multilingual Code-Switching
- URL: http://arxiv.org/abs/2401.16895v2
- Date: Sat, 3 Feb 2024 07:26:47 GMT
- Title: Cross-Lingual Transfer from Related Languages: Treating Low-Resource
Maltese as Multilingual Code-Switching
- Authors: Kurt Micallef, Nizar Habash, Claudia Borg, Fadhl Eryani, Houda Bouamor
- Abstract summary: We focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script.
We present a novel dataset annotated with word-level etymology.
We show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
- Score: 9.435669487585917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although multilingual language models exhibit impressive cross-lingual
transfer capabilities on unseen languages, the performance on downstream tasks
is impacted when there is a script disparity with the languages used in the
multilingual model's pre-training data. Using transliteration offers a
straightforward yet effective means to align the script of a resource-rich
language with a target language, thereby enhancing cross-lingual transfer
capabilities. However, for mixed languages, this approach is suboptimal, since
only a subset of the language benefits from the cross-lingual transfer while
the remainder is impeded. In this work, we focus on Maltese, a Semitic
language, with substantial influences from Arabic, Italian, and English, and
notably written in Latin script. We present a novel dataset annotated with
word-level etymology. We use this dataset to train a classifier that enables us
to make informed decisions regarding the appropriate processing of each token
in the Maltese language. We contrast indiscriminate transliteration or
translation to mixing processing pipelines that only transliterate words of
Arabic origin, thereby resulting in text with a mixture of scripts. We
fine-tune the processed data on four downstream tasks and show that conditional
transliteration based on word etymology yields the best results, surpassing
fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Unknown Script: Impact of Script on Cross-Lingual Transfer [2.5398014196797605]
Cross-lingual transfer has become an effective way of transferring knowledge between languages.
We consider a case where the target language and its script are not part of the pre-trained model.
Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.
arXiv Detail & Related papers (2024-04-29T15:48:01Z) - Zero-shot Cross-lingual Transfer without Parallel Corpus [6.937772043639308]
We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model.
It consists of a Bilingual Task Fitting module that applies task-related bilingual information alignment.
A self-training module generates pseudo soft and hard labels for unlabeled data and utilizes them to conduct self-training.
arXiv Detail & Related papers (2023-10-07T07:54:22Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.