Related papers: AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition

AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition

URL: http://arxiv.org/abs/2207.03422v1
Date: Thu, 7 Jul 2022 16:45:55 GMT
Title: AsNER -- Annotated Dataset and Baseline for Assamese Named Entity recognition
Authors: Dhrubajyoti Pathak, Sukumar Nandi, Priyankoo Sarmah
Abstract summary: The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
Score: 7.252817150901275
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.

Related papers

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities [59.22329574700317]
Spoken named entity recognition (NER) aims to identify named entities from speech. New named entities appear every day, however, annotating their Spoken NER data is costly. We propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs.
arXiv Detail & Related papers (2024-12-26T07:43:18Z)
A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala [0.8675380166590487]
This paper presents a parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs) Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil.
arXiv Detail & Related papers (2024-12-03T00:28:31Z)
Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach [50.12455129619845]
Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types. We propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER.
arXiv Detail & Related papers (2023-09-20T03:15:05Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings. We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z)
Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages [15.214673043019399]
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories. The training dataset has been automatically created from the Samanantar parallel corpus. We release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set.
arXiv Detail & Related papers (2022-12-20T11:15:24Z)
CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results. We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER. We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z)
MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity Recognition [15.805414696789796]
We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages. This dataset is designed to represent contemporary challenges in NER, including low-context scenarios.
arXiv Detail & Related papers (2022-08-30T20:45:54Z)
HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z)
Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi. We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets. We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z)
An Open-Source Dataset and A Multi-Task Model for Malay Named Entity Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens) An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z)
MobIE: A German Dataset for Named Entity Recognition, Entity Linking and Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types. To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z)
Zero-Resource Cross-Domain Named Entity Recognition [68.83177074227598]
Existing models for cross-domain named entity recognition rely on numerous unlabeled corpus or labeled NER training data in target domains. We propose a cross-domain NER model that does not use any external resources.
arXiv Detail & Related papers (2020-02-14T09:04:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.