Handling Heavily Abbreviated Manuscripts: HTR engines vs text
normalisation approaches
- URL: http://arxiv.org/abs/2107.03450v1
- Date: Wed, 7 Jul 2021 19:23:22 GMT
- Title: Handling Heavily Abbreviated Manuscripts: HTR engines vs text
normalisation approaches
- Authors: Jean-Baptiste Camps and Chahan Vidal-Gor\`ene and Marguerite Vernet
- Abstract summary: abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks.
We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text.
Case studies are drawn from the medieval Latin tradition.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Although abbreviations are fairly common in handwritten sources, particularly
in medieval and modern Western manuscripts, previous research dealing with
computational approaches to their expansion is scarce. Yet abbreviations
present particular challenges to computational approaches such as handwritten
text recognition and natural language processing tasks. Often, pre-processing
ultimately aims to lead from a digitised image of the source to a normalised
text, which includes expansion of the abbreviations. We explore different
setups to obtain such a normalised text, either directly, by training HTR
engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing
the process into discrete steps, each making use of specialist models for
recognition, word segmentation and normalisation. The case studies considered
here are drawn from the medieval Latin tradition.
Related papers
- Historical German Text Normalization Using Type- and Token-Based Language Modeling [0.0]
This report proposes a normalization system for German literary texts from c. 1700-1900, trained on a parallel corpus.
The proposed system makes use of a machine learning approach using Transformer language models, combining an encoder-decoder model to normalize individual word types, and a pre-trained causal language model to adjust these normalizations within their context.
An extensive evaluation shows that the proposed system provides state-of-the-art accuracy, comparable with a much larger fully end-to-end sentence-based normalization system, fine-tuning a pre-trained Transformer large language model.
arXiv Detail & Related papers (2024-09-04T16:14:05Z) - Fine-grained Controllable Text Generation through In-context Learning with Feedback [57.396980277089135]
We present a method for rewriting an input sentence to match specific values of nontrivial linguistic features, such as dependency depth.
In contrast to earlier work, our method uses in-context learning rather than finetuning, making it applicable in use cases where data is sparse.
arXiv Detail & Related papers (2024-06-17T08:55:48Z) - Neural machine translation for automated feedback on children's
early-stage writing [3.0695550123017514]
We address the problem of assessing and constructing feedback for early-stage writing automatically using machine learning.
We propose to use sequence-to-sequence models for "translating" early-stage writing by students into "conventional" writing.
arXiv Detail & Related papers (2023-11-15T21:32:44Z) - A Study of Augmentation Methods for Handwritten Stenography Recognition [0.0]
We study 22 classical augmentation techniques, most of which are commonly used for HTR of other scripts.
We identify a group of augmentations, including for example contained ranges of random rotation, shifts and scaling, that are beneficial to the use case of stenography recognition.
arXiv Detail & Related papers (2023-03-05T20:06:19Z) - Dealing with Abbreviations in the Slovenian Biographical Lexicon [2.0810096547938164]
Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors.
We propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text.
arXiv Detail & Related papers (2022-11-04T13:09:02Z) - Context-Tuning: Learning Contextualized Prompts for Natural Language
Generation [52.835877179365525]
We propose a novel continuous prompting approach, called Context-Tuning, to fine-tuning PLMs for natural language generation.
Firstly, the prompts are derived based on the input text, so that they can elicit useful knowledge from PLMs for generation.
Secondly, to further enhance the relevance of the generated text to the inputs, we utilize continuous inverse prompting to refine the process of natural language generation.
arXiv Detail & Related papers (2022-01-21T12:35:28Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Latin writing styles analysis with Machine Learning: New approach to old
questions [0.0]
In the Middle Ages texts were learned by heart and spread using oral means of communication from generation to generation.
Taking into account such a specific construction of literature composed in Latin, we can search for and indicate the probability patterns of familiar sources of specific narrative texts.
arXiv Detail & Related papers (2021-09-01T20:21:45Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.