NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian
Languages
- URL: http://arxiv.org/abs/2105.00810v1
- Date: Tue, 30 Mar 2021 22:10:54 GMT
- Title: NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian
Languages
- Authors: Wuraola Fisayo Oyewusi, Olubayo Adekanmbi, Ifeoma Okoh, Vitus Onuigwe,
Mary Idera Salami, Opeyemi Osakuade, Sharon Ibejih, Usman Abdullahi Musa
- Abstract summary: We present our findings on Named Entity Recognition for 5 Nigerian languages.
These languages are considered low-resourced, and very little openly available Natural Language Processing work has been done in most of them.
- Score: 6.742864446722399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the common applications of Named Entity Recognition (NER) is on
English and other highly available languages. In this work, we present our
findings on Named Entity Recognition for 5 Nigerian Languages (Nigerian
English, Nigerian Pidgin English, Igbo, Yoruba and Hausa). These languages are
considered low-resourced, and very little openly available Natural Language
Processing work has been done in most of them. In this work, individual NER
models were trained and metrics recorded for each of the languages. We also
worked on a combined model that can handle Named Entity Recognition (NER) for
any of the five languages. The combined model works well for Named Entity
Recognition(NER) on each of the languages and with better performance compared
to individual NER models trained specifically on annotated data for the
specific language. The aim of this work is to share our learning on how
information extraction using Named Entity Recognition can be optimized for the
listed Nigerian Languages for inclusion, ease of deployment in production and
reusability of models. Models developed during this project are available on
GitHub https://git.io/JY0kk and an interactive web app
https://nigner.herokuapp.com/.
Related papers
- Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages [6.7638050195383075]
We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian languages.
We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families.
We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
arXiv Detail & Related papers (2024-05-08T05:54:54Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - CL-NERIL: A Cross-Lingual Model for NER in Indian Languages [0.5926203312586108]
This paper proposes an end-to-end framework for NER for Indian languages.
We exploit parallel corpora of English and Indian languages and an English NER dataset.
We present manually annotated test sets for three Indian languages: Hindi, Bengali, and Gujarati.
arXiv Detail & Related papers (2021-11-23T12:09:15Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.