MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy
Named Entity Recognition
- URL: http://arxiv.org/abs/2310.13213v1
- Date: Fri, 20 Oct 2023 01:14:46 GMT
- Title: MultiCoNER v2: a Large Multilingual dataset for Fine-grained and Noisy
Named Entity Recognition
- Authors: Besnik Fetahu, Zhiyu Chen, Sudipta Kar, Oleg Rokhlenko, Shervin
Malmasi
- Abstract summary: This dataset aims to tackle the following practical challenges in NER: (i) effective handling of fine-grained classes that include complex entities like movie titles, and (ii) performance degradation due to noise generated from typing mistakes or OCR errors.
The dataset is compiled from open resources like Wikipedia and Wikidata, and is publicly available.
- Score: 36.868805760086886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MULTICONER V2, a dataset for fine-grained Named Entity Recognition
covering 33 entity classes across 12 languages, in both monolingual and
multilingual settings. This dataset aims to tackle the following practical
challenges in NER: (i) effective handling of fine-grained classes that include
complex entities like movie titles, and (ii) performance degradation due to
noise generated from typing mistakes or OCR errors. The dataset is compiled
from open resources like Wikipedia and Wikidata, and is publicly available.
Evaluation based on the XLM-RoBERTa baseline highlights the unique challenges
posed by MULTICONER V2: (i) the fine-grained taxonomy is challenging, where the
scores are low with macro-F1=0.63 (across all languages), and (ii) the
corruption strategy significantly impairs performance, with entity corruption
resulting in 9% lower performance relative to non-entity corruptions across all
languages. This highlights the greater impact of entity noise in contrast to
context noise.
Related papers
- Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition [40.23783832224238]
We present B2NERD, a cohesive and efficient dataset for Open NER.
We detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types.
Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.
arXiv Detail & Related papers (2024-06-17T03:57:35Z) - ACLM: A Selective-Denoising based Generative Data Augmentation Approach
for Low-Resource Complex NER [47.32935969127478]
We present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning.
ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from.
We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER.
arXiv Detail & Related papers (2023-06-01T17:33:04Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named
Entity Recognition using Knowledge Bases [53.054598423181844]
We present a novel NER cascade approach comprising three steps.
We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities.
Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting.
arXiv Detail & Related papers (2023-04-20T20:30:34Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity
Recognition [15.805414696789796]
We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages.
This dataset is designed to represent contemporary challenges in NER, including low-context scenarios.
arXiv Detail & Related papers (2022-08-30T20:45:54Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.