KazNERD: Kazakh Named Entity Recognition Dataset
- URL: http://arxiv.org/abs/2111.13419v1
- Date: Fri, 26 Nov 2021 10:56:19 GMT
- Title: KazNERD: Kazakh Named Entity Recognition Dataset
- Authors: Rustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan Varol
- Abstract summary: We present the development of a dataset for Kazakh named entity recognition.
The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh.
The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.
- Score: 5.094176584161206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the development of a dataset for Kazakh named entity recognition.
The dataset was built as there is a clear need for publicly available annotated
corpora in Kazakh, as well as annotation guidelines containing
straightforward--but rigorous--rules and examples. The dataset annotation,
based on the IOB2 scheme, was carried out on television news text by two native
Kazakh speakers under the supervision of the first author. The resulting
dataset contains 112,702 sentences and 136,333 annotations for 25 entity
classes. State-of-the-art machine learning models to automatise Kazakh named
entity recognition were also built, with the best-performing model achieving an
exact match F1-score of 97.22% on the test set. The annotated dataset,
guidelines, and codes used to train the models are freely available for
download under the CC BY 4.0 licence from https://github.com/IS2AI/KazNERD.
Related papers
- KazQAD: Kazakh Open-Domain Question Answering Dataset [2.8158674707210136]
KazQAD is a Kazakh open-domain question answering dataset.
It can be used in reading comprehension and full ODQA settings.
It contains just under 6,000 unique questions with extracted short answers.
arXiv Detail & Related papers (2024-04-06T03:40:36Z) - KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes [3.4975081145096665]
KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes numerical ratings ranging from 1 to 5.
The study also pursued the automation of Kazakh sentiment classification through the development and evaluation of four machine learning models.
arXiv Detail & Related papers (2024-03-28T11:51:11Z) - Pseudo-label Alignment for Semi-supervised Instance Segmentation [67.9616087910363]
Pseudo-labeling is significant for semi-supervised instance segmentation.
In existing pipelines, pseudo-labels that contain valuable information may be filtered out due to mismatches in class and mask quality.
We propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper.
arXiv Detail & Related papers (2023-08-10T05:56:53Z) - Slovo: Russian Sign Language Dataset [83.93252084624997]
This paper presents the Russian Sign Language (RSL) video dataset Slovo, produced using crowdsourcing platforms.
The dataset contains 20,000 FullHD recordings, divided into 1,000 classes of isolated RSL gestures received by 194 signers.
arXiv Detail & Related papers (2023-05-23T21:00:42Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT [1.2891210250935146]
Wojood consists of 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types.
The data contains about 75K entities and 22.5% of which are nested.
Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
arXiv Detail & Related papers (2022-05-19T16:06:49Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset [4.542831770689362]
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide.
The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers.
It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech applications in both academia and industry.
arXiv Detail & Related papers (2021-04-17T05:49:57Z) - Weakly-Supervised Salient Object Detection via Scribble Annotations [54.40518383782725]
We propose a weakly-supervised salient object detection model to learn saliency from scribble labels.
We present a new metric, termed saliency structure measure, to measure the structure alignment of the predicted saliency maps.
Our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models.
arXiv Detail & Related papers (2020-03-17T12:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.