MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource
Languages
- URL: http://arxiv.org/abs/2212.09723v1
- Date: Mon, 19 Dec 2022 18:49:50 GMT
- Title: MANER: Mask Augmented Named Entity Recognition for Extreme Low-Resource
Languages
- Authors: Shashank Sonkar, Zichao Wang, Richard G. Baraniuk
- Abstract summary: We introduce Mask Augmented Named Entity Recognition (MANER) for low-resource languages.
MANER re-purposes the mask> token for NER prediction. Specifically, we prepend the mask> token to every word in a sentence for which we would like to predict the named entity tag.
Experiments show that for 100 languages with as few as 100 training examples, it improves on state-of-the-art methods by up to 48% and by 12% on average on F1 score.
- Score: 27.812329651072343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the problem of Named Entity Recognition (NER) for
extreme low-resource languages with only a few hundred tagged data samples. NER
is a fundamental task in Natural Language Processing (NLP). A critical driver
accelerating NER systems' progress is the existence of large-scale language
corpora that enable NER systems to achieve outstanding performance in languages
such as English and French with abundant training data. However, NER for
low-resource languages remains relatively unexplored. In this paper, we
introduce Mask Augmented Named Entity Recognition (MANER), a new methodology
that leverages the distributional hypothesis of pre-trained masked language
models (MLMs) for NER. The <mask> token in pre-trained MLMs encodes valuable
semantic contextual information. MANER re-purposes the <mask> token for NER
prediction. Specifically, we prepend the <mask> token to every word in a
sentence for which we would like to predict the named entity tag. During
training, we jointly fine-tune the MLM and a new NER prediction head attached
to each <mask> token. We demonstrate that MANER is well-suited for NER in
low-resource languages; our experiments show that for 100 languages with as few
as 100 training examples, it improves on state-of-the-art methods by up to 48%
and by 12% on average on F1 score. We also perform detailed analyses and
ablation studies to understand the scenarios that are best-suited to MANER.
Related papers
- GEIC: Universal and Multilingual Named Entity Recognition with Large Language Models [7.714969840571947]
We introduce the task of generation-based extraction and in-context classification (GEIC)
We then propose CascadeNER, a universal and multilingual GEIC framework for few-shot and zero-shot NER.
We also introduce AnythingNER, the first NER dataset specifically designed for Large Language Models (LLMs)
arXiv Detail & Related papers (2024-09-17T09:32:12Z) - On Significance of Subword tokenization for Low Resource and Efficient
Named Entity Recognition: A case study in Marathi [1.6383036433216434]
We focus on NER for low-resource language and present our case study in the context of the Indian language Marathi.
We propose a hybrid approach for efficient NER by integrating a BERT-based subword tokenizer into vanilla CNN/LSTM models.
We show that this simple approach of replacing a traditional word-based tokenizer with a BERT-tokenizer brings the accuracy of vanilla single-layer models closer to that of deep pre-trained models like BERT.
arXiv Detail & Related papers (2023-12-03T06:53:53Z) - Self-Evolution Learning for Discriminative Language Model Pretraining [103.57103957631067]
Self-Evolution learning (SE) is a simple and effective token masking and learning method.
SE focuses on learning the informative yet under-explored tokens and adaptively regularizes the training by introducing a novel Token-specific Label Smoothing approach.
arXiv Detail & Related papers (2023-05-24T16:00:54Z) - IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named
Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps.
We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities.
Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine
Translation: The Case of Fon Language [0.015863809575305417]
We introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training.
We compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.
arXiv Detail & Related papers (2021-03-14T22:12:14Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.