Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
- URL: http://arxiv.org/abs/2506.12615v1
- Date: Sat, 14 Jun 2025 19:58:55 GMT
- Title: Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
- Authors: Nagham Hamad, Mohammed Khalilia, Mustafa Jarrar,
- Abstract summary: Konooz is a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains.<n>Konooz is useful for various NLP tasks like domain adaptation and transfer learning.
- Score: 0.5461938536945721
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Konooz, a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While Konooz is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using Konooz reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. Konooz is open-source and publicly available at https://sina.birzeit.edu/wojood/#download
Related papers
- Multi-Sense Embeddings for Language Models and Knowledge Distillation [17.559171180573664]
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different representations for the same token depending on its surrounding context.<n>We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language.
arXiv Detail & Related papers (2025-04-08T13:36:36Z) - mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search [0.0]
We introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024)
In this paper, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word.
Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.
arXiv Detail & Related papers (2024-08-07T09:34:55Z) - Can Humans Identify Domains? [17.579694463517363]
Textual domain is a crucial property within the Natural Language Processing (NLP) community due to its effects on downstream model performance.
We investigate the core notion of domains via human proficiency in identifying related intrinsic textual properties.
We find that despite the ubiquity of domains in NLP, there is little human consensus on how to define them.
arXiv Detail & Related papers (2024-04-02T09:49:07Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SMDT: Selective Memory-Augmented Neural Document Translation [53.4627288890316]
We propose a Selective Memory-augmented Neural Document Translation model to deal with documents containing large hypothesis space of context.
We retrieve similar bilingual sentence pairs from the training corpus to augment global context.
We extend the two-stream attention model with selective mechanism to capture local context and diverse global contexts.
arXiv Detail & Related papers (2022-01-05T14:23:30Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - DaN+: Danish Nested Named Entities and Lexical Normalization [18.755176247223616]
This paper introduces DaN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization.
We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task.
Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexical normalization are the most beneficial on the least canonical data.
arXiv Detail & Related papers (2021-05-24T14:35:21Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains.
We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.