Related papers: Improving Informally Romanized Language Identification

Improving Informally Romanized Language Identification

URL: http://arxiv.org/abs/2504.21540v1
Date: Wed, 30 Apr 2025 11:36:28 GMT
Title: Improving Informally Romanized Language Identification
Authors: Adrian Benton, Alexander Gutkin, Christo Kirov, Brian Roark,
Abstract summary: Romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu.<n>We increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets.<n>We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set.
Score: 49.404145019682666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Latin script is often used to informally write languages with non-Latin native scripts. In many cases (e.g., most languages in India), there is no conventional spelling of words in the Latin script, hence there will be high spelling variability in written text. Such romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. In this work, we increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We find that training on synthetic samples which incorporate natural spelling variation yields higher LID system accuracy than including available naturally occurring examples in the training set, or even training higher capacity models. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set (Madhani et al., 2023a), improving test F1 from the reported 74.7% (using a pretrained neural model) to 85.4% using a linear classifier trained solely on synthetic data and 88.2% when also training on available harvested text.

Related papers

ILID: Native Script Language Identification for Indian Languages [0.0]
Core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments.<n>We release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers.<n>Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task.
arXiv Detail & Related papers (2025-07-16T01:39:32Z)
Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.<n>This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.<n>In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z)
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z)
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages. MYTE produces shorter encodings for all 99 analyzed languages. This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z)
TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script. We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z)
Machine Translation by Projecting Text into the Same Phonetic-Orthographic Space Using a Common Encoding [3.0422770070015295]
We propose an approach based on common multilingual Latin-based encodings (WX notation) that take advantage of language similarity. We verify the proposed approach by demonstrating experiments on similar language pairs. We also get up to 1 BLEU points improvement on distant and zero-shot language pairs.
arXiv Detail & Related papers (2023-05-21T06:46:33Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages. We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z)
Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages. This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z)
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset [9.478817207385472]
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet.
arXiv Detail & Related papers (2020-07-02T14:57:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.