Beyond Arabic: Software for Perso-Arabic Script Manipulation
- URL: http://arxiv.org/abs/2301.11406v1
- Date: Thu, 26 Jan 2023 20:37:03 GMT
- Title: Beyond Arabic: Software for Perso-Arabic Script Manipulation
- Authors: Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard
Sproat
- Abstract summary: We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
- Score: 67.31374614549237
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents an open-source software library that provides a set of
finite-state transducer (FST) components and corresponding utilities for
manipulating the writing systems of languages that use the Perso-Arabic script.
The operations include various levels of script normalization, including visual
invariance-preserving operations that subsume and go beyond the standard
Unicode normalization forms, as well as transformations that modify the visual
appearance of characters in accordance with the regional orthographies for
eleven contemporary languages from diverse language families. The library also
provides simple FST-based romanization and transliteration. We additionally
attempt to formalize the typology of Perso-Arabic characters by providing
one-to-many mappings from Unicode code points to the languages that use them.
While our work focuses on the Arabic script diaspora rather than Arabic itself,
this approach could be adopted for any language that uses the Arabic script,
thus providing a unified framework for treating a script family used by close
to a billion people.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Unicode Normalization and Grapheme Parsing of Indic Languages [2.974799610163104]
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units.
Our proposed normalizer is a more efficient and effective tool than the previously used Indic normalizer.
We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
arXiv Detail & Related papers (2023-05-11T14:34:08Z) - PALI: A Language Identification Benchmark for Perso-Arabic Scripts [30.99179028187252]
This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts.
We use a set of supervised techniques to classify sentences into their languages.
We also propose a hierarchical model that targets clusters of languages that are more often confused.
arXiv Detail & Related papers (2023-04-03T19:40:14Z) - New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts
-- Managing an Under-resourced Script [0.0]
We introduce and assess a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib=i scripts.
The comparison between several state-of-the-art HTR models demonstrates the relevance of a word-based neural approach specialized for Arabic.
Results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing.
arXiv Detail & Related papers (2022-11-29T12:21:41Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish
Corpus [3.8580784887142774]
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC)
Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters)
arXiv Detail & Related papers (2020-03-20T22:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.