AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition
- URL: http://arxiv.org/abs/2207.03422v1
- Date: Thu, 7 Jul 2022 16:45:55 GMT
- Title: AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition
- Authors: Dhrubajyoti Pathak, Sukumar Nandi, Priyankoo Sarmah
- Abstract summary: The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
- Score: 7.252817150901275
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the AsNER, a named entity annotation dataset for low resource
Assamese language with a baseline Assamese NER model. The dataset contains
about 99k tokens comprised of text from the speech of the Prime Minister of
India and Assamese play. It also contains person names, location names and
addresses. The proposed NER dataset is likely to be a significant resource for
deep neural based Assamese language processing. We benchmark the dataset by
training NER models and evaluating using state-of-the-art architectures for
supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR,
MuRIL etc. We implement several baseline approaches with state-of-the-art
sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all
baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding
method. The annotated dataset and the top performing model are made publicly
available.
Related papers
- "I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech.
New named entities appear every day, however, annotating their Spoken NER data is costly.
We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z) - A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala [0.8675380166590487]
This paper presents a parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs)
Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil.
arXiv Detail & Related papers (2024-12-03T00:28:31Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains.
We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.