Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization
- URL: http://arxiv.org/abs/2406.05760v1
- Date: Sun, 9 Jun 2024 12:29:55 GMT
- Title: Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization
- Authors: Salman Elgamal, Ossama Obeid, Tameem Kabbani, Go Inoue, Nizar Habash,
- Abstract summary: The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
- Score: 9.191117990275385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild," to unveil patterns and latent information across six diverse genres: news articles, novels, children's books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, valuable datasets, and an extended diacritization algorithm. We release our code and datasets as open source.
Related papers
- Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Towards Responsible Natural Language Annotation for the Varieties of
Arabic [12.526184907781731]
We present a playbook for responsible dataset creation for polyglossic, multidialectal languages.
This work is informed by a study on Arabic annotation of social media content.
arXiv Detail & Related papers (2022-03-17T20:23:27Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Multitask Learning Approach for Diacritic Restoration [21.288912928687186]
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings.
Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word.
We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
arXiv Detail & Related papers (2020-06-07T01:20:40Z) - Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.