Development of a Dataset and a Deep Learning Baseline Named Entity
Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi
- URL: http://arxiv.org/abs/2009.06451v1
- Date: Mon, 14 Sep 2020 14:07:50 GMT
- Title: Development of a Dataset and a Deep Learning Baseline Named Entity
Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi
- Authors: Rajesh Kumar Mundotiya, Shantanu Kumar, Ajeet kumar, Umesh Chandra
Chaudhary, Supriya Chauhan, Swasti Mishra, Praveen Gatla, Anil Kumar Singh
- Abstract summary: Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages.
This paper focuses on the development of a NER benchmark dataset for the Machine Translation systems developed to translate from these languages to Hindi by annotating parts of their available corpora.
- Score: 0.983719084224035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Natural Language Processing (NLP) pipelines, Named Entity Recognition
(NER) is one of the preliminary problems, which marks proper nouns and other
named entities such as Location, Person, Organization, Disease etc. Such
entities, without a NER module, adversely affect the performance of a machine
translation system. NER helps in overcoming this problem by recognising and
handling such entities separately, although it can be useful in Information
Extraction systems also. Bhojpuri, Maithili and Magahi are low resource
languages, usually known as Purvanchal languages. This paper focuses on the
development of a NER benchmark dataset for the Machine Translation systems
developed to translate from these languages to Hindi by annotating parts of
their available corpora. Bhojpuri, Maithili and Magahi corpora of sizes 228373,
157468 and 56190 tokens, respectively, were annotated using 22 entity labels.
The annotation considers coarse-grained annotation labels followed by the
tagset used in one of the Hindi NER datasets. We also report a Deep Learning
based baseline that uses an LSTM-CNNs-CRF model. The lower baseline F1-scores
from the NER tool obtained by using Conditional Random Fields models are 96.73
for Bhojpuri, 93.33 for Maithili and 95.04 for Magahi. The Deep Learning-based
technique (LSTM-CNNs-CRF) achieved 96.25 for Bhojpuri, 93.33 for Maithili and
95.44 for Magahi.
Related papers
- Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - A Multi-way Parallel Named Entity Annotated Corpus for English, Tamil and Sinhala [0.8675380166590487]
This paper presents a parallel English-Tamil-Sinhala corpus annotated with Named Entities (NEs)
Using pre-trained multilingual Language Models (mLMs), we establish new benchmark Named Entity Recognition (NER) results on this dataset for Sinhala and Tamil.
arXiv Detail & Related papers (2024-12-03T00:28:31Z) - Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields [68.17213992395041]
Low-resource named entity recognition is still an open problem in NLP.
We present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly.
arXiv Detail & Related papers (2024-04-14T23:44:49Z) - On Significance of Subword tokenization for Low Resource and Efficient
Named Entity Recognition: A case study in Marathi [1.6383036433216434]
We focus on NER for low-resource language and present our case study in the context of the Indian language Marathi.
We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models.
We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT.
arXiv Detail & Related papers (2023-12-03T06:53:53Z) - Developing a Named Entity Recognition Dataset for Tagalog [0.0]
This dataset contains 7.8k documents across three entity types.
The inter-annotator agreement, as measured by Cohen's $kappa$, is 0.81.
We released the data and processing code publicly to inspire future work on Tagalog NLP.
arXiv Detail & Related papers (2023-11-13T08:56:47Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - MphayaNER: Named Entity Recognition for Tshivenda [2.731098538540729]
This paper introduces MphayaNER, the first Tshivenda NER corpus in the news domain.
We establish NER baselines by textitfine-tuning state-of-the-art models on MphayaNER.
The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results.
arXiv Detail & Related papers (2023-04-08T08:03:58Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT
models [0.7874708385247353]
We focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state.
We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi.
In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc.
arXiv Detail & Related papers (2022-04-12T18:32:15Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.