Building Multilingual Corpora for a Complex Named Entity Recognition and
Classification Hierarchy using Wikipedia and DBpedia
- URL: http://arxiv.org/abs/2212.07429v1
- Date: Wed, 14 Dec 2022 11:38:48 GMT
- Title: Building Multilingual Corpora for a Complex Named Entity Recognition and
Classification Hierarchy using Wikipedia and DBpedia
- Authors: Diego Alves, Gaurish Thakkar, Gabriel Amaral, Tin Kuculo, Marko
Tadi\'c
- Abstract summary: We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities.
We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the ever-growing popularity of the field of NLP, the demand for datasets
in low resourced-languages follows suit. Following a previously established
framework, in this paper, we present the UNER dataset, a multilingual and
hierarchical parallel corpus annotated for named-entities. We describe in
detail the developed procedure necessary to create this type of dataset in any
language available on Wikipedia with DBpedia information. The three-step
procedure extracts entities from Wikipedia articles, links them to DBpedia, and
maps the DBpedia sets of classes to the UNER labels. This is followed by a
post-processing procedure that significantly increases the number of identified
entities in the final results. The paper concludes with a statistical and
qualitative analysis of the resulting dataset.
Related papers
- SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - Sinhala-English Parallel Word Dictionary Dataset [0.554780083433538]
We introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages.
arXiv Detail & Related papers (2023-08-04T10:21:35Z) - Building and Evaluating Universal Named-Entity Recognition English
corpus [0.0]
This article presents the application of the Universal Named Entity framework to generate automatically annotated corpora.
By using a workflow that extracts Wikipedia data and meta-data and DBpedia information, we generated an English dataset which is described and evaluated.
arXiv Detail & Related papers (2022-12-14T11:32:24Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - Named Entity Recognition and Linking Augmented with Large-Scale
Structured Data [3.211619859724085]
We describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021.
The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection.
Our solution takes advantage of large collections of both unstructured and structured documents.
arXiv Detail & Related papers (2021-04-27T20:10:18Z) - Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.