CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality
- URL: http://arxiv.org/abs/2508.07904v1
- Date: Mon, 11 Aug 2025 12:18:41 GMT
- Title: CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality
- Authors: Marco Peer, Anna Scius-Bertrand, Andreas Fischer,
- Abstract summary: Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations.<n>We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss.<n>Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy.
- Score: 0.28617987842985093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.
Related papers
- Generative Annotation for ASR Named Entity Correction [22.96005224780927]
End-to-end automatic speech recognition systems often fail to transcribe domain-specific named entities.<n>We propose a novel NEC method that utilizes speech sound features to retrieve candidate entities.<n>We test our method using open-source and self-constructed test sets.
arXiv Detail & Related papers (2025-08-28T12:18:35Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Offline Detection of Misspelled Handwritten Words by Convolving
Recognition Model Features with Text Labels [0.0]
We introduce the task of comparing a handwriting image to text.
Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network.
Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
arXiv Detail & Related papers (2023-09-18T21:13:42Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Noise-Robust Dense Retrieval via Contrastive Alignment Post Training [89.29256833403167]
Contrastive Alignment POst Training (CAPOT) is a highly efficient finetuning method that improves model robustness without requiring index regeneration.
CAPOT enables robust retrieval by freezing the document encoder while the query encoder learns to align noisy queries with their unaltered root.
We evaluate CAPOT noisy variants of MSMARCO, Natural Questions, and Trivia QA passage retrieval, finding CAPOT has a similar impact as data augmentation with none of its overhead.
arXiv Detail & Related papers (2023-04-06T22:16:53Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Boosting offline handwritten text recognition in historical documents
with few labeled lines [5.9207487081080705]
We analyze how to perform transfer learning from a massive database to a smaller historical database.
Second, we analyze methods to efficiently combine TL and data augmentation.
An algorithm to mitigate the effects of incorrect labelings in the training set is proposed.
arXiv Detail & Related papers (2020-12-04T11:59:35Z) - OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page
Text Recognition by learning to unfold [6.09170287691728]
We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition.
We propose a novel and simple neural network module, termed textbfOrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer.
We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature.
arXiv Detail & Related papers (2020-06-12T22:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.