Stylometry for Noisy Medieval Data: Evaluating Paul Meyer's Hagiographic
Hypothesis
- URL: http://arxiv.org/abs/2012.03845v1
- Date: Mon, 7 Dec 2020 16:48:34 GMT
- Title: Stylometry for Noisy Medieval Data: Evaluating Paul Meyer's Hagiographic
Hypothesis
- Authors: Jean-Baptiste Camps, Thibault Cl\'erice, Ariane Pinche
- Abstract summary: We use a workflow combining handwritten text recognition and stylometric analysis, applied to the case of the hagiographic works contained in MS BnF, fr. 412.
We seek to evaluate Paul Meyer's hypothesis about the constitution of groups of hagiographic works, as well as to examine potential authorial groupings in a vastly anonymous corpus.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stylometric analysis of medieval vernacular texts is still a significant
challenge: the importance of scribal variation, be it spelling or more
substantial, as well as the variants and errors introduced in the tradition,
complicate the task of the would-be stylometrist. Basing the analysis on the
study of the copy from a single hand of several texts can partially mitigate
these issues (Camps and Cafiero, 2013), but the limited availability of
complete diplomatic transcriptions might make this difficult. In this paper, we
use a workflow combining handwritten text recognition and stylometric analysis,
applied to the case of the hagiographic works contained in MS BnF, fr. 412. We
seek to evaluate Paul Meyer's hypothesis about the constitution of groups of
hagiographic works, as well as to examine potential authorial groupings in a
vastly anonymous corpus.
Related papers
- StegoStylo: Squelching Stylometric Scrutiny through Steganographic Stitching [0.0]
Stylometry supports copyright and plagiarism investigations, aids detection of harmful content, and provides historical context for literary works.<n>Stylometry is employed as a tool for authorship verification--confirming whether a text truly originates from a claimed author--it can also be weaponized for malicious purposes.<n>This paper explores how adversarial stylometry combined with steganography can counteract stylometric analysis.
arXiv Detail & Related papers (2026-01-14T00:49:20Z) - Making Characters Count. A Computational Approach to Scribal Profiling in 14th-Century Middle Dutch Manuscripts from the Carthusian Monastery of Herne [0.0]
The Carthusian monastery of Herne was exceptionally prolific in producing high-quality manuscripts during the late 14th century.<n>Previous research has distinguished thirteen different scribal hands based on paleography and codicology.<n>We revisit this hypothesis through the lens of linguistic characteristics of the texts, using computational methods from the field of scribal profiling.
arXiv Detail & Related papers (2025-08-26T08:20:40Z) - The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z) - A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Says Who? Effective Zero-Shot Annotation of Focalization [0.0]
Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features.
Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging.
Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%.
arXiv Detail & Related papers (2024-09-17T17:50:15Z) - STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond [68.47402386668846]
We introduce Structured Reasoning In Critical Text Assessment (STRICTA) to model text assessment as an explicit, step-wise reasoning process.<n>STRICTA breaks down the assessment into a graph of interconnected reasoning steps drawing on causality theory.<n>We apply STRICTA to a dataset of over 4000 reasoning steps from roughly 40 biomedical experts on more than 20 papers.
arXiv Detail & Related papers (2024-09-09T06:55:37Z) - Paired Completion: Flexible Quantification of Issue-framing at Scale with LLMs [0.41436032949434404]
We develop and rigorously evaluate new detection methods for issue framing and narrative analysis within large text datasets.
We show that issue framing can be reliably and efficiently detected in large corpora with only a few examples of either perspective on a given issue.
arXiv Detail & Related papers (2024-08-19T07:14:15Z) - Impact of Ground Truth Quality on Handwriting Recognition [0.5328877196581558]
Bullinger database contains over a hundred thousand labeled text line images of mostly premodern German and Latin texts.
In this paper, we investigate the impact of such errors on training and evaluation and suggest means to detect and correct typical alignment errors.
arXiv Detail & Related papers (2023-12-14T15:36:41Z) - The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines.
Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - TFW2V: An Enhanced Document Similarity Method for the Morphologically
Rich Finnish Language [0.5801044612920816]
This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language.
We propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data.
arXiv Detail & Related papers (2021-12-23T12:27:45Z) - Image Collation: Matching illustrations in manuscripts [76.21388548732284]
We introduce the task of illustration collation and a large annotated public dataset to evaluate solutions.
We analyze state of the art similarity measures for this task and show that they succeed in simple cases but struggle for large manuscripts.
We show clear evidence that significant performance boosts can be expected by exploiting cycle-consistent correspondences.
arXiv Detail & Related papers (2021-08-18T12:12:14Z) - Toward the Understanding of Deep Text Matching Models for Information
Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval.
Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint.
Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z) - Pareto Probing: Trading Off Accuracy for Complexity [87.09294772742737]
We argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance.
Our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
arXiv Detail & Related papers (2020-10-05T17:27:31Z) - Generalized Word Shift Graphs: A Method for Visualizing and Explaining
Pairwise Comparisons Between Texts [0.15833270109954134]
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content.
We introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts.
We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences.
arXiv Detail & Related papers (2020-08-05T17:27:11Z) - A computational model implementing subjectivity with the 'Room Theory'.
The case of detecting Emotion from Text [68.8204255655161]
This work introduces a new method to consider subjectivity and general context dependency in text analysis.
By using similarity measure between words, we are able to extract the relative relevance of the elements in the benchmark.
This method could be applied to all the cases where evaluating subjectivity is relevant to understand the relative value or meaning of a text.
arXiv Detail & Related papers (2020-05-12T21:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.