MphayaNER: Named Entity Recognition for Tshivenda
- URL: http://arxiv.org/abs/2304.03952v1
- Date: Sat, 8 Apr 2023 08:03:58 GMT
- Title: MphayaNER: Named Entity Recognition for Tshivenda
- Authors: Rendani Mbuvha, David I. Adelani, Tendani Mutavhatsindi, Tshimangadzo
Rakhuhu, Aluwani Mauda, Tshifhiwa Joshua Maumela, Andisani Masindi, Seani
Rananga, Vukosi Marivate, and Tshilidzi Marwala
- Abstract summary: This paper introduces MphayaNER, the first Tshivenda NER corpus in the news domain.
We establish NER baselines by textitfine-tuning state-of-the-art models on MphayaNER.
The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results.
- Score: 2.731098538540729
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Named Entity Recognition (NER) plays a vital role in various Natural Language
Processing tasks such as information retrieval, text classification, and
question answering. However, NER can be challenging, especially in low-resource
languages with limited annotated datasets and tools. This paper adds to the
effort of addressing these challenges by introducing MphayaNER, the first
Tshivenda NER corpus in the news domain. We establish NER baselines by
\textit{fine-tuning} state-of-the-art models on MphayaNER. The study also
explores zero-shot transfer between Tshivenda and other related Bantu
languages, with chiShona and Kiswahili showing the best results. Augmenting
MphayaNER with chiShona data was also found to improve model performance
significantly. Both MphayaNER and the baseline models are made publicly
available.
Related papers
- Long Range Named Entity Recognition for Marathi Documents [0.3958317527488535]
This paper offers a comprehensive analysis of current NER techniques designed for Marathi documents.
It dives into current practices and investigates the BERT transformer model's potential for long-range Marathi NER.
The paper discusses the difficulties caused by Marathi's particular linguistic traits and contextual subtleties while acknowledging NER's critical role in NLP.
arXiv Detail & Related papers (2024-10-11T18:48:20Z) - On Significance of Subword tokenization for Low Resource and Efficient
Named Entity Recognition: A case study in Marathi [1.6383036433216434]
We focus on NER for low-resource language and present our case study in the context of the Indian language Marathi.
We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models.
We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT.
arXiv Detail & Related papers (2023-12-03T06:53:53Z) - NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - Named Entity Recognition via Machine Reading Comprehension: A Multi-Task
Learning Approach [50.12455129619845]
Named Entity Recognition (NER) aims to extract and classify entity mentions in the text into pre-defined types.
We propose to incorporate the label dependencies among entity types into a multi-task learning framework for better MRC-based NER.
arXiv Detail & Related papers (2023-09-20T03:15:05Z) - MINER: Improving Out-of-Vocabulary Named Entity Recognition from an
Information Theoretic Perspective [57.19660234992812]
NER model has achieved promising performance on standard NER benchmarks.
Recent studies show that previous approaches may over-rely on entity mention information, resulting in poor performance on out-of-vocabulary (OOV) entity recognition.
We propose MINER, a novel NER learning framework, to remedy this issue from an information-theoretic perspective.
arXiv Detail & Related papers (2022-04-09T05:18:20Z) - Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named
Entity Recognition [0.7874708385247353]
We consider NER for low-resource Indian languages like Hindi and Marathi.
We consider different variations of BERT like base-BERT, RoBERTa, and AlBERT and benchmark them on publicly available Hindi and Marathi NER datasets.
We show that the monolingual MahaRoBERTa model performs the best for Marathi NER whereas the multilingual XLM-RoBERTa performs the best for Hindi NER.
arXiv Detail & Related papers (2022-03-24T07:50:41Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - Development of a Dataset and a Deep Learning Baseline Named Entity
Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi [0.983719084224035]
Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages.
This paper focuses on the development of a NER benchmark dataset for the Machine Translation systems developed to translate from these languages to Hindi by annotating parts of their available corpora.
arXiv Detail & Related papers (2020-09-14T14:07:50Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - Probing Linguistic Features of Sentence-Level Representations in Neural
Relation Extraction [80.38130122127882]
We introduce 14 probing tasks targeting linguistic properties relevant to neural relation extraction (RE)
We use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets.
We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance.
arXiv Detail & Related papers (2020-04-17T09:17:40Z) - Neural Machine Translation: Challenges, Progress and Future [62.75523637241876]
Machine translation (MT) is a technique that leverages computers to translate human languages automatically.
neural machine translation (NMT) models direct mapping between source and target languages with deep neural networks.
This article makes a review of NMT framework, discusses the challenges in NMT and introduces some exciting recent progresses.
arXiv Detail & Related papers (2020-04-13T07:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.