Building and Evaluating Universal Named-Entity Recognition English
corpus
- URL: http://arxiv.org/abs/2212.07162v1
- Date: Wed, 14 Dec 2022 11:32:24 GMT
- Title: Building and Evaluating Universal Named-Entity Recognition English
corpus
- Authors: Diego Alves, Gaurish Thakkar, Marko Tadi\'c
- Abstract summary: This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora.
By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This article presents the application of the Universal Named Entity framework
to generate automatically annotated corpora. By using a workflow that extracts
Wikipedia data and meta-data and DBpedia information, we generated an English
dataset which is described and evaluated. Furthermore, we conducted a set of
experiments to improve the annotations in terms of precision, recall, and
F1-measure. The final dataset is available and the established workflow can be
applied to any language with existing Wikipedia and DBpedia. As part of future
research, we intend to continue improving the annotation process and extend it
to other languages.
Related papers
- Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - FRASIMED: a Clinical French Annotated Resource Produced through
Crosslingual BERT-Based Annotation Projection [0.6116681488656472]
This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection.
We present the creation of French Annotated Resource with Semantic Information for Medical Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French.
arXiv Detail & Related papers (2023-09-19T17:17:28Z) - Building Multilingual Corpora for a Complex Named Entity Recognition and
Classification Hierarchy using Wikipedia and DBpedia [0.0]
We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities.
We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
arXiv Detail & Related papers (2022-12-14T11:38:48Z) - Entity Cloze By Date: What LMs Know About Unseen Entities [79.34707800653597]
Language models (LMs) are typically trained once on a large-scale corpus and used for years without being updated.
We propose a framework to analyze what LMs can infer about new entities that did not exist when the LMs were pretrained.
We derive a dataset of entities indexed by their origination date and paired with their English Wikipedia articles, from which we can find sentences about each entity.
arXiv Detail & Related papers (2022-05-05T17:59:31Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - CREER: A Large-Scale Corpus for Relation Extraction and Entity
Recognition [9.54366784050374]
The CREER dataset uses the Stanford CoreNLP Annotator to capture rich language structures from Wikipedia plain text.
This dataset follows widely used linguistic and semantic annotations so that it can be used for not only most natural language processing tasks but also scaling the dataset.
arXiv Detail & Related papers (2022-04-27T05:43:21Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres [6.619650459583443]
We present and evaluate WikiGUM, a fully wikified dataset covering all mentions of named entities.
The dataset covers a broad range of 12 written and spoken genres, most of which have not been included in Entity Linking efforts to date.
arXiv Detail & Related papers (2021-09-15T17:35:24Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - UNER: Universal Named-Entity RecognitionFramework [0.0]
We create the first multilingual UNER corpus: the SETimesparallel corpus annotated for named-entities.
The English SETimescorpus will be annotated using existing tools and knowledge bases.
The resulting annotations will be propagated automatically to other languages within the SE-Times corpora.
arXiv Detail & Related papers (2020-10-23T13:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.