MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts
- URL: http://arxiv.org/abs/2006.12289v2
- Date: Sat, 11 Sep 2021 16:20:40 GMT
- Title: MedLatinEpi and MedLatinLit: Two Datasets for the Computational
Authorship Analysis of Medieval Latin Texts
- Authors: Silvia Corbara, Alejandro Moreo, Fabrizio Sebastiani, Mirko Tavoni
- Abstract summary: We present and make available MedLatinEpi and MedLatinLit, two datasets of medieval Latin texts to be used in research on computational authorship analysis.
MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects.
- Score: 72.16295267480838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present and make available MedLatinEpi and MedLatinLit, two datasets of
medieval Latin texts to be used in research on computational authorship
analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts,
respectively, labelled by author; MedLatinEpi texts are of epistolary nature,
while MedLatinLit texts consist of literary comments and treatises about
various subjects. As such, these two datasets lend themselves to supporting
research in authorship analysis tasks, such as authorship attribution,
authorship verification, or same-author verification. Along with the datasets
we provide experimental results, obtained on these datasets, for the authorship
verification task, i.e., the task of predicting whether a text of unknown
authorship was written by a candidate author or not. We also make available the
source code of the authorship verification system we have used, thus allowing
our experiments to be reproduced, and to be used as baselines, by other
researchers. We also describe the application of the above authorship
verification system, using these datasets as training data, for investigating
the authorship of two medieval epistles whose authorship has been disputed by
scholars.
Related papers
- LiMe: a Latin Corpus of Late Medieval Criminal Sentences [39.26357402982764]
We present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani.
arXiv Detail & Related papers (2024-04-19T12:06:28Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Cracking Double-Blind Review: Authorship Attribution with Deep Learning [43.483063713471935]
We propose a transformer-based, neural-network architecture to attribute an anonymous manuscript to an author.
We leverage all research papers publicly available on arXiv amounting to over 2 million manuscripts.
Our method achieves an unprecedented authorship attribution accuracy, where up to 73% of papers are attributed correctly.
arXiv Detail & Related papers (2022-11-14T15:50:24Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - Quasi Error-free Text Classification and Authorship Recognition in a
large Corpus of English Literature based on a Novel Feature Set [0.0]
We show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features.
Our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.
arXiv Detail & Related papers (2020-10-21T07:39:55Z) - The Frankfurt Latin Lexicon: From Morphological Expansion and Word
Embeddings to SemioGraphs [97.8648124629697]
The article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human interpretation processes based on graph representations of the underlying lexical resources.
arXiv Detail & Related papers (2020-05-21T17:16:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.