Graphemic Normalization of the Perso-Arabic Script
- URL: http://arxiv.org/abs/2210.12273v3
- Date: Mon, 29 Jan 2024 13:03:25 GMT
- Title: Graphemic Normalization of the Perso-Arabic Script
- Authors: Raiomond Doctor and Alexander Gutkin and Cibu Johny and Brian Roark
and Richard Sproat
- Abstract summary: This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
- Score: 47.429213930688086
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Since its original appearance in 1991, the Perso-Arabic script representation
in Unicode has grown from 169 to over 440 atomic isolated characters spread
over several code pages representing standard letters, various diacritics and
punctuation for the original Arabic and numerous other regional orthographic
traditions. This paper documents the challenges that Perso-Arabic presents
beyond the best-documented languages, such as Arabic and Persian, building on
earlier work by the expert community. We particularly focus on the situation in
natural language processing (NLP), which is affected by multiple, often
neglected, issues such as the use of visually ambiguous yet canonically
nonequivalent letters and the mixing of letters from different orthographies.
Among the contributing conflating factors are the lack of input methods, the
instability of modern orthographies, insufficient literacy, and loss or lack of
orthographic tradition. We evaluate the effects of script normalization on
eight languages from diverse language families in the Perso-Arabic script
diaspora on machine translation and statistical language modeling tasks. Our
results indicate statistically significant improvements in performance in most
conditions for all the languages considered when normalization is applied. We
argue that better understanding and representation of Perso-Arabic script
variation within regional orthographic traditions, where those are present, is
crucial for further progress of modern computational NLP techniques especially
for languages with a paucity of resources.
Related papers
- HATFormer: Historic Handwritten Arabic Text Recognition with Transformers [6.3660090769559945]
Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models.
We propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model.
Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges.
arXiv Detail & Related papers (2024-10-03T03:43:29Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities [36.578851892373365]
Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script.
Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
arXiv Detail & Related papers (2023-05-25T18:18:42Z) - PALI: A Language Identification Benchmark for Perso-Arabic Scripts [30.99179028187252]
This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts.
We use a set of supervised techniques to classify sentences into their languages.
We also propose a hierarchical model that targets clusters of languages that are more often confused.
arXiv Detail & Related papers (2023-04-03T19:40:14Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Huruf: An Application for Arabic Handwritten Character Recognition Using
Deep Learning [0.0]
We propose a lightweight Convolutional Neural Network-based architecture for recognizing Arabic characters and digits.
The proposed pipeline consists of a total of 18 layers containing four layers each for convolution, pooling, batch normalization, dropout, and finally one Global average layer.
The proposed model respectively achieved an accuracy of 96.93% and 99.35% which is comparable to the state-of-the-art and makes it a suitable solution for real-life end-level applications.
arXiv Detail & Related papers (2022-12-16T17:39:32Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.