Qabas: An Open-Source Arabic Lexicographic Database
- URL: http://arxiv.org/abs/2406.06598v1
- Date: Thu, 6 Jun 2024 09:25:36 GMT
- Title: Qabas: An Open-Source Arabic Lexicographic Database
- Authors: Mustafa Jarrar, Tymaa Hammouda,
- Abstract summary: We present Qabas, a novel open-source Arabic lexicon designed for NLP applications.
Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons.
Qabas lemmas are also linked to 12 morphologically annotated corpora.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Qabas, a novel open-source Arabic lexicon designed for NLP applications. The novelty of Qabas lies in its synthesis of 110 lexicons. Specifically, Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons. Furthermore, Qabas lemmas are also linked to 12 morphologically annotated corpora (about 2M tokens), making it the first Arabic lexicon to be linked to lexicons and corpora. Qabas was developed semi-automatically, utilizing a mapping framework and a web-based tool. Compared with other lexicons, Qabas stands as the most extensive Arabic lexicon, encompassing about 58K lemmas (45K nominal lemmas, 12.5K verbal lemmas, and 473 functional-word lemmas). Qabas is open-source and accessible online at https://sina.birzeit.edu/qabas.
Related papers
- QuranMorph: Morphologically Annotated Quranic Corpus [0.0]
QuranMorph is a morphologically annotated corpus for the Quran.<n>The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database.<n>The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset.
arXiv Detail & Related papers (2025-06-22T19:34:09Z) - Cross-Language Approach for Quranic QA [1.0124625066746595]
The Quranic QA system holds significant importance as it facilitates a deeper understanding of the Quran, a Holy text for over a billion people worldwide.
These systems face unique challenges, including the linguistic disparity between questions written in Modern Standard Arabic and answers found in Quranic verses written in Classical Arabic.
We adopt a cross-language approach by expanding and enriching the dataset through machine translation to convert Arabic questions into English, paraphrasing questions to create linguistic diversity, and retrieving answers from an English translation of the Quran to align with multilingual training requirements.
arXiv Detail & Related papers (2025-01-29T07:13:27Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Building Efficient and Effective OpenQA Systems for Low-Resource Languages [17.64851283209797]
We show that effective, low-cost OpenQA systems can be developed for low-resource contexts.
Key ingredients are weak supervision using machine-translated labeled datasets and a relevant unstructured knowledge source.
We present SQuAD-TR, a machine translation of SQuAD2.0, and we build our OpenQA system by adapting ColBERT-QA and retraining it over Turkish resources.
arXiv Detail & Related papers (2024-01-07T22:11:36Z) - TCE at Qur'an QA 2022: Arabic Language Question Answering Over Holy
Qur'an Using a Post-Processed Ensemble of BERT-based Models [0.0]
Arabic is the language of the Holy Qur'an; the sacred text for 1.8 billion people across the world.
We propose an ensemble learning model based on Arabic variants of BERT models.
Our system achieves a Partial Reciprocal Rank (pRR) score of 56.6% on the official test set.
arXiv Detail & Related papers (2022-06-03T13:00:48Z) - The Arabic Ontology -- An Arabic Wordnet with Ontologically Clean
Content [0.0]
Ontology consists of about 1,300 well-investigated concepts in addition to 11,000 concepts that are partially validated.
Ontology is accessible and searchable through a lexicographic search engine.
Ontology is fully mapped with Princeton WordNet, Wikidata, and other resources.
arXiv Detail & Related papers (2022-05-19T16:27:44Z) - DUAL: Textless Spoken Question Answering with Speech Discrete Unit
Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years.
Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect.
This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z) - QALD-9-plus: A Multilingual Dataset for Question Answering over DBpedia
and Wikidata Translated by Native Speakers [68.9964449363406]
We extend one of the most popular KGQA benchmarks - QALD-9 by introducing high-quality questions' translations to 8 languages.
Five of the languages - Armenian, Ukrainian, Lithuanian, Bashkir and Belarusian - to our best knowledge were never considered in KGQA research community before.
arXiv Detail & Related papers (2022-01-31T22:19:55Z) - Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering
Approach for Open-Domain Question Answering [76.99585451345702]
Open-Retrieval Generative Question Answering (GenQA) is proven to deliver high-quality, natural-sounding answers in English.
We present the first generalization of the GenQA approach for the multilingual environment.
arXiv Detail & Related papers (2021-10-14T04:36:29Z) - Neural Coreference Resolution for Arabic [12.986359659930146]
We introduce a coreference resolution system for Arabic based on Lee et al's end to end architecture combined with the Arabic version of bert and an external mention detector.
As far as we know, this is the first neural coreference resolution system aimed specifically to Arabic.
It substantially outperforms the existing state of the art on OntoNotes 5.0 with a gain of 15.2 points conll F1.
arXiv Detail & Related papers (2020-10-31T14:34:43Z) - Efficient One-Pass End-to-End Entity Linking for Questions [48.776127715663826]
We present ELQ, a fast end-to-end entity linking model for questions.
Uses a biencoder to jointly perform mention detection and linking in one pass.
With a very fast inference time (1.57 examples/s on a single CPU), ELQ can be useful for downstream question answering systems.
arXiv Detail & Related papers (2020-10-06T01:14:10Z) - Quantum Büchi Automata [4.998632546280976]
We introduce the classes of $omega$-languages recognized by QBAs in probable, almost sure, strict and non-strict threshold semantics.
We show that there are surprisingly only at most four substantially different classes of $omega$-languages recognized by QBAs (out of uncountably infinite)
The relationship between classical $omega$-languages and QBAs is clarified using our pumping lemmas.
arXiv Detail & Related papers (2018-04-24T12:23:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.