TriNER: A Series of Named Entity Recognition Models For Hindi, Bengali & Marathi
- URL: http://arxiv.org/abs/2502.04245v1
- Date: Thu, 06 Feb 2025 17:37:36 GMT
- Title: TriNER: A Series of Named Entity Recognition Models For Hindi, Bengali & Marathi
- Authors: Mohammed Amaan Dhamaskar, Rasika Ransing,
- Abstract summary: This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali & Marathi.
We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups.
- Score: 0.0
- License:
- Abstract: India's rich cultural and linguistic diversity poses various challenges in the domain of Natural Language Processing (NLP), particularly in Named Entity Recognition (NER). NER is a NLP task that aims to identify and classify tokens into different entity groups like Person, Location, Organization, Number, etc. This makes NER very useful for downstream tasks like context-aware anonymization. This paper details our work to build a multilingual NER model for the three most spoken languages in India - Hindi, Bengali & Marathi. We train a custom transformer model and fine tune a few pretrained models, achieving an F1 Score of 92.11 for a total of 6 entity groups. Through this paper, we aim to introduce a single model to perform NER and significantly reduce the inconsistencies in entity groups and tag names, across the three languages.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named
Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps.
We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities.
Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z) - BanglaCoNER: Towards Robust Bangla Complex Named Entity Recognition [0.0]
We present the winning solution of Bangla Complex Named Entity Recognition Challenge.
The dataset consisted of 15300 sentences for training and 800 sentences for validation, in the.conll format.
Our findings also demonstrate the efficacy of Deep Learning models such as BanglaBERT for NER in Bangla language.
arXiv Detail & Related papers (2023-03-16T13:31:31Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - WANLI: Worker and AI Collaboration for Natural Language Inference
Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration.
We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns.
The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian
Languages [6.742864446722399]
We present our findings on Named Entity Recognition for 5 Nigerian languages.
These languages are considered low-resourced, and very little openly available Natural Language Processing work has been done in most of them.
arXiv Detail & Related papers (2021-03-30T22:10:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.