MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written
in Latin Script
- URL: http://arxiv.org/abs/2206.09167v1
- Date: Sat, 18 Jun 2022 10:17:46 GMT
- Title: MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written
in Latin Script
- Authors: Randa Zarnoufi, Walid Bachri, Hamid Jaafar and Mounia Abik
- Abstract summary: We exploit the powerfulness of word embedding models generated with a corpus of YouTube comments.
We have built a normalization dictionary that we refer to as MANorm.
- Score: 0.05833117322405446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media user-generated text is actually the main resource for many NLP
tasks. This text however, does not follow the standard rules of writing.
Moreover, the use of dialect such as Moroccan Arabic in written communications
increases further NLP tasks complexity. A dialect is a verbal language that
does not have a standard orthography, which leads users to improvise spelling
while writing. Thus, for the same word we can find multiple forms of
transliterations. Subsequently, it is mandatory to normalize these different
transliterations to one canonical word form. To reach this goal, we have
exploited the powerfulness of word embedding models generated with a corpus of
YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that
provides the canonical forms, we have built a normalization dictionary that we
refer to as MANorm. We have conducted several experiments to demonstrate the
efficiency of MANorm, which have shown its usefulness in dialect normalization.
Related papers
- AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA)
We benchmark newly developed sequence-to-sequence models on the task of CODAfication.
We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE)
We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments.
Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z) - Offensive Language Detection in Under-resourced Algerian Dialectal
Arabic Language [0.0]
We focus on the Algerian dialectal Arabic which is one of under-resourced languages.
Due to the scarcity of works on the same language, we have built a new corpus regrouping more than 8.7k texts manually annotated as normal, abusive and offensive.
arXiv Detail & Related papers (2022-03-18T15:42:21Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.