Automatic Identification of Types of Alterations in Historical
Manuscripts
- URL: http://arxiv.org/abs/2003.09136v3
- Date: Wed, 4 Nov 2020 15:36:16 GMT
- Title: Automatic Identification of Types of Alterations in Historical
Manuscripts
- Authors: David Lassner (TUB), Anne Baillot (3L.AM), Sergej Dogadov (TUB),
Klaus-Robert M\"uller (TUB), Shinichi Nakajima (TUB)
- Abstract summary: We present a machine learning-based approach to help categorize alterations in documents.
In particular, we present a new probabilistic model that categorizes content-related alterations.
On unlabelled data, applying alterLDA leads to interesting new insights into the alteration behavior of authors, editors and other manuscript contributors.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alterations in historical manuscripts such as letters represent a promising
field of research. On the one hand, they help understand the construction of
text. On the other hand, topics that are being considered sensitive at the time
of the manuscript gain coherence and contextuality when taking alterations into
account, especially in the case of deletions. The analysis of alterations in
manuscripts, though, is a traditionally very tedious work. In this paper, we
present a machine learning-based approach to help categorize alterations in
documents. In particular, we present a new probabilistic model (Alteration
Latent Dirichlet Allocation, alterLDA in the following) that categorizes
content-related alterations. The method proposed here is developed based on
experiments carried out on the digital scholarly edition Berlin Intellectuals,
for which alterLDA achieves high performance in the recognition of alterations
on labelled data. On unlabelled data, applying alterLDA leads to interesting
new insights into the alteration behavior of authors, editors and other
manuscript contributors, as well as insights into sensitive topics in the
correspondence of Berlin intellectuals around 1800. In addition to the findings
based on the digital scholarly edition Berlin Intellectuals, we present a
general framework for the analysis of text genesis that can be used in the
context of other digital resources representing document variants. To that end,
we present in detail the methodological steps that are to be followed in order
to achieve such results, giving thereby a prime example of an Machine Learning
application the Digital Humanities.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - CASIMIR: A Corpus of Scientific Articles enhanced with Multiple Author-Integrated Revisions [7.503795054002406]
We propose an original textual resource on the revision step of the writing process of scientific articles.
This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews.
arXiv Detail & Related papers (2024-03-01T03:07:32Z) - A Literature Review of Literature Reviews in Pattern Analysis and Machine Intelligence [58.6354685593418]
This paper proposes several article-level, field-normalized, and large language model-empowered bibliometric indicators to evaluate reviews.
The newly emerging AI-generated literature reviews are also appraised.
This work offers insights into the current challenges of literature reviews and envisions future directions for their development.
arXiv Detail & Related papers (2024-02-20T11:28:50Z) - Stylometry Analysis of Multi-authored Documents for Authorship and
Author Style Change Detection [2.117778717665161]
This paper investigates three key tasks of style analysis: (i) classification of single and multi-authored documents, (ii) single change detection, and (iii) multiple author-switching detection in multi-authored documents.
We propose a merit-based fusion framework that integrates several state-of-the-art natural language processing (NLP) algorithms and weight optimization techniques.
arXiv Detail & Related papers (2024-01-12T18:36:41Z) - Don't lose the message while paraphrasing: A study on content preserving
style transfer [61.38460184163704]
Content preservation is critical for real-world applications of style transfer studies.
We compare various style transfer models on the example of the formality transfer domain.
We conduct a precise comparative study of several state-of-the-art techniques for style transfer.
arXiv Detail & Related papers (2023-08-17T15:41:08Z) - To Revise or Not to Revise: Learning to Detect Improvable Claims for
Argumentative Writing Support [20.905660642919052]
We explore the main challenges to identifying argumentative claims in need of specific revisions.
We propose a new sampling strategy based on revision distance.
We provide evidence that using contextual information and domain knowledge can further improve prediction results.
arXiv Detail & Related papers (2023-05-26T10:19:54Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis.
We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq)
The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.