MSNER: A Multilingual Speech Dataset for Named Entity Recognition
- URL: http://arxiv.org/abs/2405.11519v1
- Date: Sun, 19 May 2024 11:17:00 GMT
- Title: MSNER: A Multilingual Speech Dataset for Named Entity Recognition
- Authors: Quentin Meeus, Marie-Francine Moens, Hugo Van hamme,
- Abstract summary: We introduce MSNER, a freely available, multilingual speech corpus annotated with named entities.
It provides annotations to the VoxPopuli dataset in four languages.
It results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set.
- Score: 34.88608417778945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - Visual Speech Recognition for Languages with Limited Labeled Data using
Automatic Labels from Whisper [96.43501666278316]
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages.
We employ a Whisper model which can conduct both language identification and audio-based speech recognition.
By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels.
arXiv Detail & Related papers (2023-09-15T16:53:01Z) - WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning
Experiments for Slovak Named Entity Recognition [0.0]
We introduce WikiGoldSK, the first sizable human labelled Slovak NER dataset.
We benchmark it by evaluating state-of-the-art multilingual Pretrained Language Models.
We conduct few-shot experiments and show that training on a sliver-standard dataset yields better results.
arXiv Detail & Related papers (2023-04-08T14:37:52Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Selective Annotation Makes Language Models Better Few-Shot Learners [97.07544941620367]
Large language models can perform in-context learning, where they learn a new task from a few task demonstrations.
This work examines the implications of in-context learning for the creation of datasets for new natural language tasks.
We propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate.
arXiv Detail & Related papers (2022-09-05T14:01:15Z) - XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection [0.42056926734482064]
The dataset consists of human-annotated Finnish (25k) and English sentences (30k)
We use Plutchik's core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset.
The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets.
arXiv Detail & Related papers (2020-11-03T10:43:22Z) - UNER: Universal Named-Entity RecognitionFramework [0.0]
We create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities.
The English SETimescorpus will be annotated using existing tools and knowledge bases.
The resulting annotations will be propagated automatically to other languages within the SE-Times corpora.
arXiv Detail & Related papers (2020-10-23T13:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.