Related papers: Beyond Arabic: Software for Perso-Arabic Script Manipulation

Beyond Arabic: Software for Perso-Arabic Script Manipulation

URL: http://arxiv.org/abs/2301.11406v1
Date: Thu, 26 Jan 2023 20:37:03 GMT
Title: Beyond Arabic: Software for Perso-Arabic Script Manipulation
Authors: Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard Sproat
Abstract summary: We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
Score: 67.31374614549237
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora rather than Arabic itself, this approach could be adopted for any language that uses the Arabic script, thus providing a unified framework for treating a script family used by close to a billion people.

Related papers

Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration [70.84108518476744]
We show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script.<n>We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
arXiv Detail & Related papers (2026-01-06T10:45:04Z)
The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages [30.39307182175106]
We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language.<n>Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.
arXiv Detail & Related papers (2025-07-24T19:28:33Z)
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z)
Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z)
CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules. COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
Unicode Normalization and Grapheme Parsing of Indic Languages [2.974799610163104]
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units. Our proposed normalizer is a more efficient and effective tool than the previously used Indic normalizer. We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
arXiv Detail & Related papers (2023-05-11T14:34:08Z)
PALI: A Language Identification Benchmark for Perso-Arabic Scripts [30.99179028187252]
This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts. We use a set of supervised techniques to classify sentences into their languages. We also propose a hierarchical model that targets clusters of languages that are more often confused.
arXiv Detail & Related papers (2023-04-03T19:40:14Z)
New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts -- Managing an Under-resourced Script [0.0]
We introduce and assess a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib=i scripts. The comparison between several state-of-the-art HTR models demonstrates the relevance of a word-based neural approach specialized for Arabic. Results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing.
arXiv Detail & Related papers (2022-11-29T12:21:41Z)
Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages. We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z)
TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus [3.8580784887142774]
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC) Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters)
arXiv Detail & Related papers (2020-03-20T22:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.