A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks
- URL: http://arxiv.org/abs/2504.19645v1
- Date: Mon, 28 Apr 2025 10:02:11 GMT
- Title: A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks
- Authors: Shadan Shukr Sabr, Nazira Sabr Mustafa, Talar Sabah Omar, Salah Hwayyiz Rasool, Nawzad Anwer Omer, Darya Sabir Hamad, Hemin Abdulhameed Shams, Omer Mahmood Kareem, Rozhan Noori Abdullah, Khabat Atar Abdullah, Mahabad Azad Mohammad, Haneen Al-Raghefy, Safar M. Asaad, Sara Jamal Mohammed, Twana Saeed Ali, Fazil Shawrow, Halgurd S. Maghdid,
- Abstract summary: Low-resourced languages, such as the Central-Kurdish language (CKL), mainly remain unexamined due to shortage of necessary resources to support their development.<n>This study presented an accurate and comprehensive POS tagset for the CKL to provide better performance of the Kurdish NLP tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: - The field of natural language processing (NLP) has dramatically expanded within the last decade. Many human-being applications are conducted daily via NLP tasks, starting from machine translation, speech recognition, text generation and recommendations, Part-of-Speech tagging (POS), and Named-Entity Recognition (NER). However, low-resourced languages, such as the Central-Kurdish language (CKL), mainly remain unexamined due to shortage of necessary resources to support their development. The POS tagging task is the base of other NLP tasks; for example, the POS tag set has been used to standardized languages to provide the relationship between words among the sentences, followed by machine translation and text recommendation. Specifically, for the CKL, most of the utilized or provided POS tagsets are neither standardized nor comprehensive. To this end, this study presented an accurate and comprehensive POS tagset for the CKL to provide better performance of the Kurdish NLP tasks. The article also collected most of the POS tags from different studies as well as from Kurdish linguistic experts to standardized part-of-speech tags. The proposed POS tagset is designed to annotate a large CKL corpus and support Kurdish NLP tasks. The initial investigations of this study via comparison with the Universal Dependencies framework for standard languages, show that the proposed POS tagset can streamline or correct sentences more accurately for Kurdish NLP tasks.
Related papers
- Zero-Shot Keyphrase Generation: Investigating Specialized Instructions and Multi-Sample Aggregation on Large Language Models [52.829293635314194]
Keyphrase generation is a long-standing NLP task for automatically generating keyphrases for a given document.
We focus on the zero-shot capabilities of open-source instruction-tuned LLMs (Phi-3, Llama-3) and the closed-source GPT-4o for this task.
arXiv Detail & Related papers (2025-03-01T19:38:57Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - MUG: A General Meeting Understanding and Generation Benchmark [60.09540662936726]
We build the AliMeeting4MUG Corpus, which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage.
In this paper, we provide a detailed introduction of this corpus, SLP tasks and evaluation methods, baseline systems and their performance.
arXiv Detail & Related papers (2023-03-24T11:52:25Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Part of Speech Tagging (POST) of a Low-resource Language using another
Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged
Persian (Farsi) Corpus) [0.76146285961466]
Part of Speech Tagging (POST) is essential in developing tagged corpora.
The Kurdish language currently lacks publicly available tagged corpora of proper sizes.
We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon.
arXiv Detail & Related papers (2022-01-30T11:49:43Z) - Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction [21.67895423776014]
We consider POS tagging within the framework of set-valued prediction.
We find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings.
arXiv Detail & Related papers (2020-08-04T07:21:36Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Weakly Supervised POS Taggers Perform Poorly on Truly Low-Resource
Languages [20.6788155478186]
We train and evaluate state-of-the-art weakly supervised POS taggers for a typologically diverse set of 15 truly low-resource languages.
Our results highlight the need for new and different approaches to POS tagging for truly low-resource languages.
arXiv Detail & Related papers (2020-04-28T05:14:08Z) - Machine Learning Approaches for Amharic Parts-of-speech Tagging [0.0]
Performance of the current POS taggers in Amharic is not as good as that of the contemporary POS taggers available for English and other European languages.
The aim of this work is to improve POS tagging performance for the Amharic language, which was never above 91%.
arXiv Detail & Related papers (2020-01-10T06:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.