Named Entity Recognition and Linking Augmented with Large-Scale
Structured Data
- URL: http://arxiv.org/abs/2104.13456v1
- Date: Tue, 27 Apr 2021 20:10:18 GMT
- Title: Named Entity Recognition and Linking Augmented with Large-Scale
Structured Data
- Authors: Pawe{\l} Rychlikowski, Bart{\l}omiej Najdecki, Adrian {\L}a\'ncucki,
Adam Kaczmarek
- Abstract summary: We describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021.
The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection.
Our solution takes advantage of large collections of both unstructured and structured documents.
- Score: 3.211619859724085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we describe our submissions to the 2nd and 3rd SlavNER Shared
Tasks held at BSNLP 2019 and BSNLP 2021, respectively. The tasks focused on the
analysis of Named Entities in multilingual Web documents in Slavic languages
with rich inflection. Our solution takes advantage of large collections of both
unstructured and structured documents. The former serve as data for
unsupervised training of language models and embeddings of lexical units. The
latter refers to Wikipedia and its structured counterpart - Wikidata, our
source of lemmatization rules, and real-world entities. With the aid of those
resources, our system could recognize, normalize and link entities, while being
trained with only small amounts of labeled data.
Related papers
- Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia [14.221520251569173]
We develop a framework for entity insertion called LocEI and its multilingual variant XLocEI.
We show that XLocEI outperforms all baseline models and can be applied in a zero-shot manner on languages not seen during training with minimal performance drop.
These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
arXiv Detail & Related papers (2024-10-05T18:22:15Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - Cross-lingual Named Entity Corpus for Slavic Languages [1.8693484642696736]
This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing.
The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities.
arXiv Detail & Related papers (2024-03-30T22:20:08Z) - Towards a Brazilian History Knowledge Graph [50.26735825937335]
We construct a knowledge graph for Brazilian history based on the Brazilian Dictionary of Historical Biographies (DHBB) and Wikipedia/Wikidata.
We show that many terms/entities described in the DHBB do not have corresponding concepts (or Q items) in Wikidata.
arXiv Detail & Related papers (2024-03-28T22:05:32Z) - Building Multilingual Corpora for a Complex Named Entity Recognition and
Classification Hierarchy using Wikipedia and DBpedia [0.0]
We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities.
We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
arXiv Detail & Related papers (2022-12-14T11:38:48Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.