Related papers: Digital Editions as Distant Supervision for Layout Analysis of Printed Books

Digital Editions as Distant Supervision for Layout Analysis of Printed Books

URL: http://arxiv.org/abs/2112.12703v1
Date: Thu, 23 Dec 2021 16:51:53 GMT
Title: Digital Editions as Distant Supervision for Layout Analysis of Printed Books
Authors: Alejandro H. Toselli, Si Wu and David A. Smith
Abstract summary: We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
Score: 76.29918490722902
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Archivists, textual scholars, and historians often produce digital editions of historical documents. Using markup schemes such as those of the Text Encoding Initiative and EpiDoc, these digital editions often record documents' semantic regions (such as notes and figures) and physical features (such as page and line breaks) as well as transcribing their textual content. We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.

Related papers

Structure-Aware Text Recognition for Ancient Greek Critical Editions [16.43811675687955]
This paper investigates structure-aware text recognition for Ancient Greek critical editions.<n>We introduce a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation.<n>We evaluate three state-of-the-art visual language models under both zero-shot and fine-tuning regimes.
arXiv Detail & Related papers (2026-03-03T09:42:43Z)
FP-THD: Full page transcription of historical documents [0.0]
This work proposes a pipeline for the transcription of historical documents preserving special features.<n>We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page.
arXiv Detail & Related papers (2026-01-20T07:13:38Z)
The OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing [2.7471068141502]
We aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies.<n>In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data.<n>This research contributes a novel approach into building models that can easily adopt under-represented alphabets.
arXiv Detail & Related papers (2025-06-07T11:05:33Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z)
PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents. We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z)
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis [3.231170156689185]
Document AI aims to automatically analyze documents by leveraging natural language processing and computer vision techniques. One of the major tasks of Document AI is document layout analysis, which structures document pages by interpreting the content and spatial relationships of layout, image, and text.
arXiv Detail & Related papers (2023-08-29T16:58:03Z)
The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines. Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z)
PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA. Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods [1.9938405188113029]
We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net. We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
arXiv Detail & Related papers (2022-03-23T11:56:25Z)
Continuous Offline Handwriting Recognition using Deep Learning Models [0.0]
Handwritten text recognition is an open problem of great interest in the area of automatic document image analysis. We have proposed a new recognition model based on integrating two types of deep learning architectures: convolutional neural networks (CNN) and sequence-to-sequence (seq2seq) The new proposed model provides competitive results with those obtained with other well-established methodologies.
arXiv Detail & Related papers (2021-12-26T07:31:03Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI. We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z)
A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.