Related papers: Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations

URL: http://arxiv.org/abs/2212.06468v1
Date: Tue, 13 Dec 2022 10:37:10 GMT
Title: Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations
Authors: Mustafa Jarrar and Fadi A Zaraket and Tymaa Hammouda and Daanish Masood Alavi and Martin Waahlisch
Abstract summary: This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic Lisan corpora. We collected the content of the corpora from several social media platforms. The annotators segemented all words in the four corpora into prefixes, stems and suffixes labeled each with different morphological features such as part of speech, lemma, and a gloss in English.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic dialects Lisan corpora. Lisan features around 1.2 million tokens. We collected the content of the corpora from several social media platforms. The Yemeni corpus (~ 1.05M tokens) was collected automatically from Twitter. The corpora of the other three dialects (~ 50K tokens each) came manually from Facebook and YouTube posts and comments. Thirty five (35) annotators who are native speakers of the target dialects carried out the annotations. The annotators segemented all words in the four corpora into prefixes, stems and suffixes and labeled each with different morphological features such as part of speech, lemma, and a gloss in English. An Arabic Dialect Annotation Toolkit ADAT was developped for the purpose of the annation. The annotators were trained on a set of guidelines and on how to use ADAT. We developed ADAT to assist the annotators and to ensure compatibility with SAMA and Curras tagsets. The tool is open source, and the four corpora are also available online.

Related papers

QuranMorph: Morphologically Annotated Quranic Corpus [0.0]
QuranMorph is a morphologically annotated corpus for the Quran.<n>The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database.<n>The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset.
arXiv Detail & Related papers (2025-06-22T19:34:09Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus [8.96693684560691]
ZAEBUC-Spoken is a multilingual multidialectal Arabic-English speech corpus. The corpus presents a challenging set for automatic speech recognition (ASR) We take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages.
arXiv Detail & Related papers (2024-03-27T01:19:23Z)
BiMediX: Bilingual Medical Mixture of Experts LLM [94.85518237963535]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z)
ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi) We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z)
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages. We developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z)
Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z)
Maknuune: A Large Open Palestinian Arabic Lexicon [8.230763074145706]
Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. Maknuune is a large open lexicon for the Palestinian Arabic dialect.
arXiv Detail & Related papers (2022-10-24T07:19:03Z)
Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect. dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z)
Comprehensive Benchmark Datasets for Amharic Scene Text Detection and Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa. The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals. We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z)
QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z)
Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries. Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language. A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.