Related papers: A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

URL: http://arxiv.org/abs/2301.04962v1
Date: Thu, 12 Jan 2023 12:13:44 GMT
Title: A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities
Authors: Sazan Salar and Hossein Hassani
Abstract summary: We present a data set that covers several categories of NEs in Kurdish (Sorani) The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit)
Score: 0.76146285961466
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Named Entity Recognition (NER) is one of the essential applications of Natural Language Processing (NLP). It is also an instrument that plays a significant role in many other NLP applications, such as Machine Translation (MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is an under-resourced language from the NLP perspective. Particularly, in all the categories, the lack of NER resources hinders other aspects of Kurdish processing. In this work, we present a data set that covers several categories of NEs in Kurdish (Sorani). The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit). It covers 11 categories and 33261 entries in total. The dataset is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/.

Related papers

NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages [3.5403652483328223]
This work proposes a methodology for fine-tuning the pre-trained RoBERTa model for Kurdish NER (KNER) Experiments show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance.
arXiv Detail & Related papers (2024-12-15T07:07:17Z)
Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification [1.174020933567308]
Hawrami, a dialect of Kurdish, is classified as an endangered language. This paper introduces various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers.
arXiv Detail & Related papers (2024-09-25T12:52:21Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z)
Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z)
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia. Most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z)
Mukayese: Turkish NLP Strikes Back [0.19116784879310023]
We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications. We present Mukayese, a set of NLP benchmarks for the Turkish language. We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
arXiv Detail & Related papers (2022-03-02T16:18:44Z)
Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus) [0.76146285961466]
Part of Speech Tagging (POST) is essential in developing tagged corpora. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon.
arXiv Detail & Related papers (2022-01-30T11:49:43Z)
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources [3.345437353879255]
textitMasader is the largest public catalogue for Arabic NLP datasets. We develop a metadata annotation strategy that could be extended to other languages.
arXiv Detail & Related papers (2021-10-13T14:25:21Z)
Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts [0.76146285961466]
Punkt is an unsupervised machine learning method. We used Punkt to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%.
arXiv Detail & Related papers (2020-04-09T06:44:08Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.