One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models
- URL: http://arxiv.org/abs/2601.05776v1
- Date: Fri, 09 Jan 2026 13:00:46 GMT
- Title: One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models
- Authors: Benedikt Ebing, Lennart Keller, Goran Glavaš,
- Abstract summary: romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs)<n>We investigate two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap.<n>Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts.
- Score: 6.607456010103339
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.
Related papers
- RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs [18.27925188037189]
Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora.<n>This raises a fundamental question: How do LLMs achieve such multilingual capabilities?<n> Focusing on languages written in non-Roman scripts, we investigate the role of Romanization as a potential bridge in multilingual processing.
arXiv Detail & Related papers (2025-02-11T10:10:26Z) - Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages [37.61699757912346]
We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations.<n>We show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL)<n>This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation from both leads to significant performance improvements.
arXiv Detail & Related papers (2024-11-04T18:59:51Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization [17.46921734622369]
Romanized text reduces token fertility by 2x-4x.
Romanized text matches or outperforms native script representation across various NLU, NLG, and MT tasks.
arXiv Detail & Related papers (2024-01-25T16:11:41Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Does Transliteration Help Multilingual Language Modeling? [0.0]
We empirically measure the effect of transliteration on Multilingual Language Models.
We focus on the Indic languages, which have the highest script diversity in the world.
We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages.
arXiv Detail & Related papers (2022-01-29T05:48:42Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.