Related papers: Building and Evaluating Universal Named-Entity Recognition English corpus

Building and Evaluating Universal Named-Entity Recognition English corpus

URL: http://arxiv.org/abs/2212.07162v1
Date: Wed, 14 Dec 2022 11:32:24 GMT
Title: Building and Evaluating Universal Named-Entity Recognition English corpus
Authors: Diego Alves, Gaurish Thakkar, Marko Tadi\'c
Abstract summary: This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora. By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated. Furthermore, we conducted a set of experiments to improve the annotations in terms of precision, recall, and F1-measure. The final dataset is available and the established workflow can be applied to any language with existing Wikipedia and DBpedia. As part of future research, we intend to continue improving the annotation process and extend it to other languages.

Related papers

Evaluating D-MERIT of Partial-annotation on Information Retrieval [77.44452769932676]
Retrieval models are often evaluated on partially-annotated datasets. We show that using partially-annotated datasets in evaluation can paint a distorted picture.
arXiv Detail & Related papers (2024-06-23T08:24:08Z)
Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction. We reformulate the task to be entity-centric, enabling the use of diverse metrics. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z)
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection [0.6116681488656472]
This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. We present the creation of French Annotated Resource with Semantic Information for Medical Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French.
arXiv Detail & Related papers (2023-09-19T17:17:28Z)
Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia [0.0]
We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
arXiv Detail & Related papers (2022-12-14T11:38:48Z)
Entity Cloze By Date: What LMs Know About Unseen Entities [79.34707800653597]
Language models (LMs) are typically trained once on a large-scale corpus and used for years without being updated. We propose a framework to analyze what LMs can infer about new entities that did not exist when the LMs were pretrained. We derive a dataset of entities indexed by their origination date and paired with their English Wikipedia articles, from which we can find sentences about each entity.
arXiv Detail & Related papers (2022-05-05T17:59:31Z)
Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages. We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z)
CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition [9.54366784050374]
The CREER dataset uses the Stanford CoreNLP Annotator to capture rich language structures from Wikipedia plain text. This dataset follows widely used linguistic and semantic annotations so that it can be used for not only most natural language processing tasks but also scaling the dataset.
arXiv Detail & Related papers (2022-04-27T05:43:21Z)
Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language. The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German. We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z)
WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres [6.619650459583443]
We present and evaluate WikiGUM, a fully wikified dataset covering all mentions of named entities. The dataset covers a broad range of 12 written and spoken genres, most of which have not been included in Entity Linking efforts to date.
arXiv Detail & Related papers (2021-09-15T17:35:24Z)
MobIE: A German Dataset for Named Entity Recognition, Entity Linking and Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types. To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z)
UNER: Universal Named-Entity RecognitionFramework [0.0]
We create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities. The English SETimescorpus will be annotated using existing tools and knowledge bases. The resulting annotations will be propagated automatically to other languages within the SE-Times corpora.
arXiv Detail & Related papers (2020-10-23T13:53:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.