Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis
- URL: http://arxiv.org/abs/2511.22315v1
- Date: Thu, 27 Nov 2025 10:46:11 GMT
- Title: Named Entity Recognition for the Kurdish Sorani Language: Dataset Creation and Comparative Analysis
- Authors: Bakhtawar Abdalla, Rebwar Mala Nabi, Hassan Eshkiki, Fabio Caraffini,
- Abstract summary: This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques.<n>It proposes the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language.
- Score: 1.0499611180329804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work contributes towards balancing the inclusivity and global applicability of natural language processing techniques by proposing the first 'name entity recognition' dataset for Kurdish Sorani, a low-resource and under-represented language, that consists of 64,563 annotated tokens. It also provides a tool for facilitating this task in this and many other languages and performs a thorough comparative analysis, including classic machine learning models and neural systems. The results obtained challenge established assumptions about the advantage of neural approaches within the context of NLP. Conventional methods, in particular CRF, obtain F1-scores of 0.825, outperforming the results of BiLSTM-based models (0.706) significantly. These findings indicate that simpler and more computationally efficient classical frameworks can outperform neural architectures in low-resource settings.
Related papers
- HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language [1.3465808629549525]
This paper introduces a novel benchmark dataset comprising 5,000 YouTube comments in Hausa and code-switched English.<n>We use this dataset to conduct a comparative analysis of classical models and fine-tuned transformer models.<n>Our results reveal a key finding: the Decision Tree classifier, with an accuracy and F1-score 89.72% and 89.60% respectively, significantly outperformed the deep learning models.
arXiv Detail & Related papers (2025-09-17T22:57:21Z) - NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages [3.5403652483328223]
This work proposes a methodology for fine-tuning the pre-trained RoBERTa model for Kurdish NER (KNER)<n>Experiments show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance.
arXiv Detail & Related papers (2024-12-15T07:07:17Z) - Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese [1.7457686843484872]
We conduct experiments using various combinations of contextualized language models (CLM) and neural networks.
We find that the joint approach of CLM and neural networks is simple yet capable of achieving high-quality performance.
arXiv Detail & Related papers (2024-11-20T15:46:48Z) - TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model.
Our method enhances local model performance on various benchmarks.
It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - A Sequence-to-Sequence Approach for Arabic Pronoun Resolution [0.0]
This paper proposes a sequence-to-sequence learning approach for Arabic pronoun resolution.
The proposed approach is evaluated on the AnATAr dataset.
arXiv Detail & Related papers (2023-05-19T08:53:41Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.