Related papers: Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts

Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts

URL: http://arxiv.org/abs/2407.02320v1
Date: Tue, 2 Jul 2024 14:51:20 GMT
Title: Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts
Authors: Chunlan Ma, Yihong Liu, Haotian Ye, Hinrich Schütze,
Abstract summary: We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
Score: 50.40191599304911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decoder-only large language models (LLMs) excel in high-resource languages across various tasks through few-shot or even zero-shot in-context learning (ICL). However, their performance often does not transfer well to low-resource languages, especially those written in non-Latin scripts. Inspired by recent work that leverages transliteration in encoder-only models, we investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. To this end, we propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. We apply these methods to several representative LLMs of different sizes on various tasks including text classification and sequential labeling. Our findings show that the effectiveness of transliteration varies by task type and model size. For instance, all models benefit from transliterations for sequential labeling (with increases of up to 25%).

Related papers

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? [0.5281705920363684]
Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks.<n>With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting.<n>We evaluate the performance of current language models on text classification tasks across several South Slavic languages.
arXiv Detail & Related papers (2025-11-11T08:54:15Z)
Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages [37.61699757912346]
We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. We show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL) This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation from both leads to significant performance improvements.
arXiv Detail & Related papers (2024-11-04T18:59:51Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models. We focus on the Indic languages, which have the highest script diversity in the world. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z)
UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages. We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.