Enhancing Low Resource NER Using Assisting Language And Transfer
Learning
- URL: http://arxiv.org/abs/2306.06477v1
- Date: Sat, 10 Jun 2023 16:31:04 GMT
- Title: Enhancing Low Resource NER Using Assisting Language And Transfer
Learning
- Authors: Maithili Sabane, Aparna Ranade, Onkar Litake, Parth Patil, Raviraj
Joshi, Dipali Kadam
- Abstract summary: We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
- Score: 0.7340017786387767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named Entity Recognition (NER) is a fundamental task in NLP that is used to
locate the key information in text and is primarily applied in conversational
and search systems. In commercial applications, NER or comparable slot-filling
methods have been widely deployed for popular languages. NER is used in
applications such as human resources, customer service, search engines, content
classification, and academia. In this paper, we draw focus on identifying name
entities for low-resource Indian languages that are closely related, like Hindi
and Marathi. We use various adaptations of BERT such as baseBERT, AlBERT, and
RoBERTa to train a supervised NER model. We also compare multilingual models
with monolingual models and establish a baseline. In this work, we show the
assisting capabilities of the Hindi and Marathi languages for the NER task. We
show that models trained using multiple languages perform better than a single
language. However, we also observe that blind mixing of all datasets doesn't
necessarily provide improvements and data selection methods may be required.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian
Languages [6.742864446722399]
We present our findings on Named Entity Recognition for 5 Nigerian languages.
These languages are considered low-resourced, and very little openly available Natural Language Processing work has been done in most of them.
arXiv Detail & Related papers (2021-03-30T22:10:54Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on
Unlabeled Data in Target Language [28.8970132244542]
Cross-lingual NER must leverage knowledge learned from source languages with rich labeled data.
We propose a teacher-student learning method to address such limitations.
Our method outperforms existing state-of-the-art methods for both single-source and multi-source cross-lingual NER.
arXiv Detail & Related papers (2020-04-26T17:22:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.