HiNER: A Large Hindi Named Entity Recognition Dataset
- URL: http://arxiv.org/abs/2204.13743v1
- Date: Thu, 28 Apr 2022 19:14:21 GMT
- Title: HiNER: A Large Hindi Named Entity Recognition Dataset
- Authors: Rudra Murthy, Pallab Bhattacharjee, Rahul Sharnagat, Jyotsana Khatri,
Diptesh Kanojia, Pushpak Bhattacharyya
- Abstract summary: This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
- Score: 29.300418937509317
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Named Entity Recognition (NER) is a foundational NLP task that aims to
provide class labels like Person, Location, Organisation, Time, and Number to
words in free text. Named Entities can also be multi-word expressions where the
additional I-O-B annotation information helps label them during the NER
annotation process. While English and European languages have considerable
annotated data for the NER task, Indian languages lack on that front -- both in
terms of quantity and following annotation standards. This paper releases a
significantly sized standard-abiding Hindi NER dataset containing 109,146
sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset
statistics in all their essential detail and provide an in-depth analysis of
the NER tag-set used with our data. The statistics of tag-set in our dataset
show a healthy per-tag distribution, especially for prominent classes like
Person, Location and Organisation. Since the proof of resource-effectiveness is
in building models with the resource and testing the model on benchmark data
and against the leader-board entries in shared tasks, we do the same with the
aforesaid data. We use different language models to perform the sequence
labelling task for NER and show the efficacy of our data by performing a
comparative evaluation with models trained on another dataset available for the
Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all
the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To
the best of our knowledge, no available dataset meets the standards of volume
(amount) and variability (diversity), as far as Hindi NER is concerned. We fill
this gap through this work, which we hope will significantly help NLP for
Hindi. We release this dataset with our code and models at
https://github.com/cfiltnlp/HiNER
Related papers
- Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - GPT-NER: Named Entity Recognition via Large Language Models [58.609582116612934]
GPT-NER transforms the sequence labeling task to a generation task that can be easily adapted by Language Models.
We find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce.
This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.
arXiv Detail & Related papers (2023-04-20T16:17:26Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages [15.214673043019399]
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories.
The training dataset has been automatically created from the Samanantar parallel corpus.
We release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set.
arXiv Detail & Related papers (2022-12-20T11:15:24Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - CL-NERIL: A Cross-Lingual Model for NER in Indian Languages [0.5926203312586108]
This paper proposes an end-to-end framework for NER for Indian languages.
We exploit parallel corpora of English and Indian languages and an English NER dataset.
We present manually annotated test sets for three Indian languages: Hindi, Bengali, and Gujarati.
arXiv Detail & Related papers (2021-11-23T12:09:15Z) - NEREL: A Russian Dataset with Nested Named Entities and Relations [55.69103749079697]
We present NEREL, a Russian dataset for named entity recognition and relation extraction.
It contains 56K annotated named entities and 39K annotated relations.
arXiv Detail & Related papers (2021-08-30T10:40:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.