On Significance of Subword tokenization for Low Resource and Efficient
Named Entity Recognition: A case study in Marathi
- URL: http://arxiv.org/abs/2312.01306v1
- Date: Sun, 3 Dec 2023 06:53:53 GMT
- Title: On Significance of Subword tokenization for Low Resource and Efficient
Named Entity Recognition: A case study in Marathi
- Authors: Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar,
Raviraj Joshi, Sachin Pande
- Abstract summary: We focus on NER for low-resource language and present our case study in the context of the Indian language Marathi.
We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models.
We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT.
- Score: 1.6383036433216434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named Entity Recognition (NER) systems play a vital role in NLP applications
such as machine translation, summarization, and question-answering. These
systems identify named entities, which encompass real-world concepts like
locations, persons, and organizations. Despite extensive research on NER
systems for the English language, they have not received adequate attention in
the context of low resource languages. In this work, we focus on NER for
low-resource language and present our case study in the context of the Indian
language Marathi. The advancement of NLP research revolves around the
utilization of pre-trained transformer models such as BERT for the development
of NER models. However, we focus on improving the performance of shallow models
based on CNN, and LSTM by combining the best of both worlds. In the era of
transformers, these traditional deep learning models are still relevant because
of their high computational efficiency. We propose a hybrid approach for
efficient NER by integrating a BERT-based subword tokenizer into vanilla
CNN/LSTM models. We show that this simple approach of replacing a traditional
word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla
single-layer models closer to that of deep pre-trained models like BERT. We
show the importance of using sub-word tokenization for NER and present our
study toward building efficient NLP systems. The evaluation is performed on
L3Cube-MahaNER dataset using tokenizers from MahaBERT, MahaGPT, IndicBERT, and
mBERT.
Related papers
- Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer [50.572974726351504]
We propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT.
In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form.
The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
arXiv Detail & Related papers (2023-09-14T12:14:49Z) - Enhancing Low Resource NER Using Assisting Language And Transfer
Learning [0.7340017786387767]
We use baseBERT, AlBERT, and RoBERTa to train a supervised NER model.
We show that models trained using multiple languages perform better than a single language.
arXiv Detail & Related papers (2023-06-10T16:31:04Z) - IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named
Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps.
We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities.
Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z) - German BERT Model for Legal Named Entity Recognition [0.43461794560295636]
We fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset.
The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset.
arXiv Detail & Related papers (2023-03-07T11:54:39Z) - MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource
Languages [27.812329651072343]
We introduce Mask Augmented Named Entity Recognition (MANER) for low-resource languages.
MANER re-purposes the mask> token for NER prediction. Specifically, we prepend the mask> token to every word in a sentence for which we would like to predict the named entity tag.
Experiments show that for 100 languages with as few as 100 training examples, it improves on state-of-the-art methods by up to 48% and by 12% on average on F1 score.
arXiv Detail & Related papers (2022-12-19T18:49:50Z) - Distantly-Supervised Named Entity Recognition with Noise-Robust Learning
and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data.
We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step.
Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.