Related papers: L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models

L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models

URL: http://arxiv.org/abs/2204.06029v1
Date: Tue, 12 Apr 2022 18:32:15 GMT
Title: L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models
Authors: Parth Patil, Aparna Ranade, Maithili Sabane, Onkar Litake, Raviraj Joshi
Abstract summary: We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
Score: 0.7874708385247353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

Related papers

A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala [0.8675380166590487]
This paper presents a parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs) Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil.
arXiv Detail & Related papers (2024-12-03T00:28:31Z)
L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models [1.8624310307965966]
The L3Cube-MahaSocialNER dataset is the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations.
arXiv Detail & Related papers (2023-12-30T08:30:24Z)
Enhancing Low Resource NER Using Assisting Language And Transfer Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model. We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z)
IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps. We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities. Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z)
L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library [1.14219428942199]
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection.
arXiv Detail & Related papers (2022-05-29T17:51:00Z)
Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi. We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets. We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z)
L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources [1.14219428942199]
We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We show the effectiveness of these resources on downstream classification and NER tasks.
arXiv Detail & Related papers (2022-02-02T17:35:52Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem. For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token. We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns. We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models. Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.