Curras + Baladi: Towards a Levantine Corpus
- URL: http://arxiv.org/abs/2205.09692v1
- Date: Thu, 19 May 2022 16:53:04 GMT
- Title: Curras + Baladi: Towards a Levantine Corpus
- Authors: Karim El Haff, Mustafa Jarrar, Tymaa Hammouda, Fadi Zaraket
- Abstract summary: We present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens.
Our proposed corpus was constructed to be used to enrich Curras and transform it into a more general Levantine corpus.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The processing of the Arabic language is a complex field of research. This is
due to many factors, including the complex and rich morphology of Arabic, its
high degree of ambiguity, and the presence of several regional varieties that
need to be processed while taking into account their unique characteristics.
When its dialects are taken into account, this language pushes the limits of
NLP to find solutions to problems posed by its inherent nature. It is a
diglossic language; the standard language is used in formal settings and in
education and is quite different from the vernacular languages spoken in the
different regions and influenced by older languages that were historically
spoken in those regions. This should encourage NLP specialists to create
dialect-specific corpora such as the Palestinian morphologically annotated
Curras corpus of Birzeit University. In this work, we present the Lebanese
Corpus Baladi that consists of around 9.6K morphologically annotated tokens.
Since Lebanese and Palestinian dialects are part of the same Levantine
dialectal continuum, and thus highly mutually intelligible, our proposed corpus
was constructed to be used to (1) enrich Curras and transform it into a more
general Levantine corpus and (2) improve Curras by solving detected errors.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus [8.96693684560691]
ZAEBUC-Spoken is a multilingual multidialectal Arabic-English speech corpus.
The corpus presents a challenging set for automatic speech recognition (ASR)
We take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages.
arXiv Detail & Related papers (2024-03-27T01:19:23Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - Post-hoc analysis of Arabic transformer models [20.741730718486032]
We probe how linguistic information is encoded in the transformer models, trained on different Arabic dialects.
We perform a layer and neuron analysis on the models using morphological tagging tasks for different dialects of Arabic and a dialectal identification task.
arXiv Detail & Related papers (2022-10-18T16:53:51Z) - VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE)
We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments.
Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z) - Interpreting Arabic Transformer Models [18.98681439078424]
We probe how linguistic information is encoded in Arabic pretrained models, trained on different varieties of Arabic language.
We perform a layer and neuron analysis on the models using three intrinsic tasks: two morphological tagging tasks based on MSA (modern standard Arabic) and dialectal POS-tagging and a dialectal identification task.
arXiv Detail & Related papers (2022-01-19T06:32:25Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.