The Frankfurt Latin Lexicon: From Morphological Expansion and Word
Embeddings to SemioGraphs
- URL: http://arxiv.org/abs/2005.10790v1
- Date: Thu, 21 May 2020 17:16:53 GMT
- Title: The Frankfurt Latin Lexicon: From Morphological Expansion and Word
Embeddings to SemioGraphs
- Authors: Alexander Mehler, Bernhard Jussen, Tim Geelhaar, Alexander Henlein,
Giuseppe Abrami, Daniel Baumartz, Tolga Uslu, Wahed Hemati
- Abstract summary: The article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human interpretation processes based on graph representations of the underlying lexical resources.
- Score: 97.8648124629697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this article we present the Frankfurt Latin Lexicon (FLL), a lexical
resource for Medieval Latin that is used both for the lemmatization of Latin
texts and for the post-editing of lemmatizations. We describe recent advances
in the development of lemmatizers and test them against the Capitularies corpus
(comprising Frankish royal edicts, mid-6th to mid-9th century), a corpus
created as a reference for processing Medieval Latin. We also consider the
post-correction of lemmatizations using a limited crowdsourcing process aimed
at continuous review and updating of the FLL. Starting from the texts resulting
from this lemmatization process, we describe the extension of the FLL by means
of word embeddings, whose interactive traversing by means of SemioGraphs
completes the digital enhanced hermeneutic circle. In this way, the article
argues for a more comprehensive understanding of lemmatization, encompassing
classical machine learning as well as intellectual post-corrections and, in
particular, human computation in the form of interpretation processes based on
graph representations of the underlying lexical resources.
Related papers
- Comparative Analysis of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters [6.883666189245419]
This paper presents the first computational analysis of semantic change pre- and post-Norman Conquest.
It is the first systematic comparison of static and contextual embeddings in a scarce historical data set.
Our findings confirm that, consistent with existing studies, contextual embeddings outperform static word embeddings in capturing semantic change.
arXiv Detail & Related papers (2024-10-11T22:19:17Z) - eFontes. Part of Speech Tagging and Lemmatization of Medieval Latin Texts.A Cross-Genre Survey [41.94295877935867]
The eFontes models were trained on Universal Dependencies (UD) corpora and the newly developed eFontes corpus of Polish Medieval Latin.
The models achieved high accuracy rates: lemmatization at 92.60%, part-of-speech tagging at 83.29%, and morphological feature determination at 88.57%.
arXiv Detail & Related papers (2024-06-29T11:59:20Z) - LiMe: a Latin Corpus of Late Medieval Criminal Sentences [39.26357402982764]
We present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani.
arXiv Detail & Related papers (2024-04-19T12:06:28Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - FRACAS: A FRench Annotated Corpus of Attribution relations in newS [0.0]
We present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution.
We first describe the composition of our corpus and the choices that were made in selecting the data.
We then detail our inter-annotator agreement between the 8 annotators who worked on manual labelling.
arXiv Detail & Related papers (2023-09-19T13:19:54Z) - Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy.
We present a new method that allows achieving high results on this task with little effort.
We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Lexical semantic change for Ancient Greek and Latin [61.69697586178796]
Associating a word's correct meaning in its historical context is a central challenge in diachronic research.
We build on a recent computational approach to semantic change based on a dynamic Bayesian mixture model.
We provide a systematic comparison of dynamic Bayesian mixture models for semantic change with state-of-the-art embedding-based models.
arXiv Detail & Related papers (2021-01-22T12:04:08Z) - Latin BERT: A Contextual Language Model for Classical Philology [7.513100214864645]
We present Latin BERT, a contextual language model for the Latin language.
It was trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century.
arXiv Detail & Related papers (2020-09-21T17:47:44Z) - MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts [72.16295267480838]
We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
arXiv Detail & Related papers (2020-06-22T14:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.