Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition
- URL: http://arxiv.org/abs/2203.12907v1
- Date: Thu, 24 Mar 2022 07:50:41 GMT
- Title: Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition
- Authors: Onkar Litake, Maithili Sabane, Parth Patil, Aparna Ranade, Raviraj
Joshi
- Abstract summary: We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named entity recognition (NER) is the process of recognising and classifying
important information (entities) in text. Proper nouns, such as a person's
name, an organization's name, or a location's name, are examples of entities.
The NER is one of the important modules in applications like human resources,
customer support, search engines, content classification, and academia. In this
work, we consider NER for low-resource Indian languages like Hindi and Marathi.
The transformer-based models have been widely used for NER tasks. We consider
different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark
them on publicly available Hindi and Marathi NER datasets. We provide an
exhaustive comparison of different monolingual and multilingual
transformer-based models and establish simple baselines currently missing in
the literature. We show that the monolingual MahaRoBERTa model performs the
best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for
Hindi NER. We also perform cross-language evaluation and present mixed
observations.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Named Entity Recognition via Machine Reading Comprehension: A Multi-Task
Learning Approach [50.12455129619845]
Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types.
We propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER.
arXiv Detail & Related papers (2023-09-20T03:15:05Z) - Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named
Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps.
We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities.
Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Mono vs Multilingual BERT for Hate Speech Detection and Text
Classification: A Case Study in Marathi [0.966840768820136]
We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi.
We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi.
We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts.
arXiv Detail & Related papers (2022-04-19T05:07:58Z) - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT
models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state.
We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi.
In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z) - CL-NERIL: A Cross-Lingual Model for NER in Indian Languages [0.5926203312586108]
This paper proposes an end-to-end framework for NER for Indian languages.
We exploit parallel corpora of English and Indian languages and an English NER dataset.
We present manually annotated test sets for three Indian languages: Hindi, Bengali, and Gujarati.
arXiv Detail & Related papers (2021-11-23T12:09:15Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.