WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning
Experiments for Slovak Named Entity Recognition
- URL: http://arxiv.org/abs/2304.04026v1
- Date: Sat, 8 Apr 2023 14:37:52 GMT
- Title: WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning
Experiments for Slovak Named Entity Recognition
- Authors: D\'avid \v{S}uba and Marek \v{S}uppa and Jozef Kub\'ik and Endre
Hamerlik and Martin Tak\'a\v{c}
- Abstract summary: We introduce WikiGoldSK, the first sizable human labelled Slovak NER dataset.
We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models.
We conduct few-shot experiments and show that training on a sliver-standard dataset yields better results.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Named Entity Recognition (NER) is a fundamental NLP tasks with a wide range
of practical applications. The performance of state-of-the-art NER methods
depends on high quality manually anotated datasets which still do not exist for
some languages. In this work we aim to remedy this situation in Slovak by
introducing WikiGoldSK, the first sizable human labelled Slovak NER dataset. We
benchmark it by evaluating state-of-the-art multilingual Pretrained Language
Models and comparing it to the existing silver-standard Slovak NER dataset. We
also conduct few-shot experiments and show that training on a sliver-standard
dataset yields better results. To enable future work that can be based on
Slovak NER, we release the dataset, code, as well as the trained models
publicly under permissible licensing terms at
https://github.com/NaiveNeuron/WikiGoldSK.
Related papers
- NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages [3.5403652483328223]
This work proposes a methodology for fine-tuning the pre-trained RoBERTa model for Kurdish NER (KNER)
Experiments show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance.
arXiv Detail & Related papers (2024-12-15T07:07:17Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Learning to Rank Context for Named Entity Recognition Using a Synthetic Dataset [6.633914491587503]
We propose to generate a synthetic context retrieval training dataset using Alpaca.
Using this dataset, we train a neural context retriever based on a BERT model that is able to find relevant context for NER.
We show that our method outperforms several retrieval baselines for the NER task on an English literary dataset composed of the first chapter of 40 books.
arXiv Detail & Related papers (2023-10-16T06:53:12Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - AdvPicker: Effectively Leveraging Unlabeled Data via Adversarial
Discriminator for Cross-Lingual NER [2.739898536581301]
We design an adversarial learning framework in which an encoder learns entity domain knowledge from labeled source-language data.
We show that the proposed method benefits strongly from this data selection process and outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-04T07:17:18Z) - iNLTK: Natural Language Toolkit for Indic Languages [0.0]
We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for data augmentation, textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic languages.
arXiv Detail & Related papers (2020-09-26T08:21:32Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.