A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to
Kurdish-BLARK Named Entities
- URL: http://arxiv.org/abs/2301.04962v1
- Date: Thu, 12 Jan 2023 12:13:44 GMT
- Title: A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to
Kurdish-BLARK Named Entities
- Authors: Sazan Salar and Hossein Hassani
- Abstract summary: We present a data set that covers several categories of NEs in Kurdish (Sorani)
The dataset is a significant amendment to a previously developed dataset in the Kurdish BLARK (Basic Language Resource Kit)
- Score: 0.76146285961466
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Named Entity Recognition (NER) is one of the essential applications of
Natural Language Processing (NLP). It is also an instrument that plays a
significant role in many other NLP applications, such as Machine Translation
(MT), Information Retrieval (IR), and Part of Speech Tagging (POST). Kurdish is
an under-resourced language from the NLP perspective. Particularly, in all the
categories, the lack of NER resources hinders other aspects of Kurdish
processing. In this work, we present a data set that covers several categories
of NEs in Kurdish (Sorani). The dataset is a significant amendment to a
previously developed dataset in the Kurdish BLARK (Basic Language Resource
Kit). It covers 11 categories and 33261 entries in total. The dataset is
publicly available for non-commercial use under CC BY-NC-SA 4.0 license at
https://kurdishblark.github.io/.
Related papers
- Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Mukayese: Turkish NLP Strikes Back [0.19116784879310023]
We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications.
We present Mukayese, a set of NLP benchmarks for the Turkish language.
We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
arXiv Detail & Related papers (2022-03-02T16:18:44Z) - Part of Speech Tagging (POST) of a Low-resource Language using another
Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged
Persian (Farsi) Corpus) [0.76146285961466]
Part of Speech Tagging (POST) is essential in developing tagged corpora.
The Kurdish language currently lacks publicly available tagged corpora of proper sizes.
We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon.
arXiv Detail & Related papers (2022-01-30T11:49:43Z) - Masader: Metadata Sourcing for Arabic Text and Speech Data Resources [3.345437353879255]
textitMasader is the largest public catalogue for Arabic NLP datasets.
We develop a metadata annotation strategy that could be extended to other languages.
arXiv Detail & Related papers (2021-10-13T14:25:21Z) - Data and Representation for Turkish Natural Language Inference [6.135815931215188]
We offer a positive response for natural language inference (NLI) in Turkish.
We translate two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels.
We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large.
arXiv Detail & Related papers (2020-04-30T17:12:52Z) - Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments
on Kurdish (Sorani) Texts [0.76146285961466]
Punkt is an unsupervised machine learning method.
We used Punkt to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script.
In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%.
arXiv Detail & Related papers (2020-04-09T06:44:08Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.