MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity
Recognition
- URL: http://arxiv.org/abs/2208.14536v1
- Date: Tue, 30 Aug 2022 20:45:54 GMT
- Title: MultiCoNER: A Large-scale Multilingual dataset for Complex Named Entity
Recognition
- Authors: Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, Oleg
Rokhlenko
- Abstract summary: We present MultiCoNER, a large multilingual dataset for Named Entity Recognition that covers 3 domains (Wiki sentences, questions, and search queries) across 11 languages.
This dataset is designed to represent contemporary challenges in NER, including low-context scenarios.
- Score: 15.805414696789796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MultiCoNER, a large multilingual dataset for Named Entity
Recognition that covers 3 domains (Wiki sentences, questions, and search
queries) across 11 languages, as well as multilingual and code-mixing subsets.
This dataset is designed to represent contemporary challenges in NER, including
low-context scenarios (short and uncased text), syntactically complex entities
like movie titles, and long-tail entity distributions. The 26M token dataset is
compiled from public resources using techniques such as heuristic-based
sentence sampling, template extraction and slotting, and machine translation.
We applied two NER models on our dataset: a baseline XLM-RoBERTa model, and a
state-of-the-art GEMNET model that leverages gazetteers. The baseline achieves
moderate performance (macro-F1=54%), highlighting the difficulty of our data.
GEMNET, which uses gazetteers, improvement significantly (average improvement
of macro-F1=+30%). MultiCoNER poses challenges even for large pre-trained
language models, and we believe that it can help further research in building
robust NER systems. MultiCoNER is publicly available at
https://registry.opendata.aws/multiconer/ and we hope that this resource will
help advance research in various aspects of NER.
Related papers
- 2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion [9.038363543966263]
We construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image)
We introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module.
Our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines.
arXiv Detail & Related papers (2024-04-26T02:34:31Z) - SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series.
We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks.
We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z) - Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark [39.01204607174688]
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages.
UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages.
arXiv Detail & Related papers (2023-11-15T17:09:54Z) - Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval [56.65147231836708]
We develop SWIM-IR, a synthetic retrieval training dataset containing 33 languages for fine-tuning multilingual dense retrievers.
SAP assists the large language model (LLM) in generating informative queries in the target language.
Our models, called SWIM-X, are competitive with human-supervised dense retrieval models.
arXiv Detail & Related papers (2023-11-10T00:17:10Z) - NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System
for Multilingual Named Entity Recognition [94.90258603217008]
The MultiCoNER RNum2 shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios.
Previous top systems in the MultiCoNER RNum1 either incorporate the knowledge bases or gazetteers.
We propose a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER.
arXiv Detail & Related papers (2023-05-05T16:59:26Z) - Large Scale Multi-Lingual Multi-Modal Summarization Dataset [26.92121230628835]
We present the current largest multi-lingual multi-modal summarization dataset (M3LS)
It consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair.
It is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages.
arXiv Detail & Related papers (2023-02-13T18:00:23Z) - UM6P-CS at SemEval-2022 Task 11: Enhancing Multilingual and Code-Mixed
Complex Named Entity Recognition via Pseudo Labels using Multilingual
Transformer [7.270980742378389]
We introduce our submitted system to the Multilingual Complex Named Entity Recognition (MultiCoNER) shared task.
We approach the complex NER for multilingual and code-mixed queries, by relying on the contextualized representation provided by the multilingual Transformer XLM-RoBERTa.
Our proposed system is ranked 6th and 8th in the multilingual and code-mixed MultiCoNER's tracks respectively.
arXiv Detail & Related papers (2022-04-28T14:07:06Z) - An Open-Source Dataset and A Multi-Task Model for Malay Named Entity
Recognition [3.511753382329252]
We build a Malay NER dataset (MYNER) comprising 28,991 sentences (over 384 thousand tokens)
An auxiliary task, boundary detection, is introduced to improve NER training in both explicit and implicit ways.
arXiv Detail & Related papers (2021-09-03T03:29:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.