People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts
- URL: http://arxiv.org/abs/2305.16718v2
- Date: Tue, 6 Jun 2023 20:42:10 GMT
- Title: People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts
- Authors: V\'it Novotn\'y, Krist\'yna Luger, Michal \v{S}tef\'anik, Tereza
Vrabcov\'a, Ale\v{s} Hor\'ak
- Abstract summary: We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although pre-trained named entity recognition (NER) models are highly
accurate on modern corpora, they underperform on historical texts due to
differences in language OCR errors. In this work, we develop a new NER corpus
of 3.6M sentences from late medieval charters written mainly in Czech, Latin,
and German.
We show that we can start with a list of known historical figures and
locations and an unannotated corpus of historical texts, and use information
retrieval techniques to automatically bootstrap a NER-annotated corpus. Using
our corpus, we train a NER model that achieves entity-level Precision of
72.81-93.98% with 58.14-81.77% Recall on a manually-annotated test dataset.
Furthermore, we show that using a weighted loss function helps to combat class
imbalance in token classification tasks. To make it easy for others to
reproduce and build upon our work, we publicly release our corpus, models, and
experimental code.
Related papers
- LexMatcher: Dictionary-centric Data Collection for LLM-based Machine Translation [67.24113079928668]
We present LexMatcher, a method for data curation driven by the coverage of senses found in bilingual dictionaries.
Our approach outperforms the established baselines on the WMT2022 test sets.
arXiv Detail & Related papers (2024-06-03T15:30:36Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic
Mistakes [93.19166902594168]
We propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation.
Key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus.
We evaluate SESCORE2 and previous methods on four text generation tasks across three languages.
arXiv Detail & Related papers (2022-12-19T09:02:16Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - hmBERT: Historical Multilingual Language Models for Named Entity
Recognition [0.6226609932118123]
We tackle NER for identifying persons, locations, and organizations in historical texts.
In this work, we tackle NER for historical German, English, French, Swedish, and Finnish by training large historical language models.
arXiv Detail & Related papers (2022-05-31T07:30:33Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - Cross-context News Corpus for Protest Events related Knowledge Base
Construction [0.15393457051344295]
We describe a gold standard corpus of protest events that comprise of various local and international sources in English.
This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information.
arXiv Detail & Related papers (2020-08-01T22:20:48Z) - Automatically Ranked Russian Paraphrase Corpus for Text Generation [0.0]
The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation.
Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag.
arXiv Detail & Related papers (2020-06-17T08:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.