MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset
- URL: http://arxiv.org/abs/2601.21512v1
- Date: Thu, 29 Jan 2026 10:28:01 GMT
- Title: MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset
- Authors: Serry Sibaee, Yasser Alhabashi, Nadia Sibai, Yara Farouk, Adel Ammar, Sawsan AlHalawani, Wadii Boulila,
- Abstract summary: MURAD (Multi-domain Unified Reverse Arabic Dictionary) is an open lexical dataset with 96,243 word-definition pairs.<n>The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering.<n>It supports computational linguistics and lexicographic research.
- Score: 4.621023604328769
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.
Related papers
- WikiTermBase: An AI-Augmented Term Base to Standardize Arabic Translation on Wikipedia [0.0]
This abstract introduces an open source tool, WikiTermBase, with a systematic approach for building a lexicographical database with over 900K terms.<n>The tool was successfully implemented on Arabic Wikipedia to standardize translated English and French terms.
arXiv Detail & Related papers (2025-05-26T11:27:01Z) - Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines [0.8944616102795021]
This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system.<n>We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers.<n>Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions.
arXiv Detail & Related papers (2025-04-30T09:56:36Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [70.23624194206171]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - LexGen: Domain-aware Multilingual Lexicon Generation [40.97738267067852]
We propose a new model to generate dictionary words for $6$ Indian languages in the multi-domain setting.<n>Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique.<n>We release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains.
arXiv Detail & Related papers (2024-05-18T07:02:43Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain [5.916745177895035]
In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book.<n>To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well.
arXiv Detail & Related papers (2023-06-22T16:50:40Z) - Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy.
We present a new method that allows achieving high results on this task with little effort.
We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.