Robust Text Line Detection in Historical Documents: Learning and
Evaluation Methods
- URL: http://arxiv.org/abs/2203.12346v1
- Date: Wed, 23 Mar 2022 11:56:25 GMT
- Title: Robust Text Line Detection in Historical Documents: Learning and
Evaluation Methods
- Authors: M\'elodie Boillet, Christopher Kermorvant, Thierry Paquet
- Abstract summary: We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net.
We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
- Score: 1.9938405188113029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text line segmentation is one of the key steps in historical document
understanding. It is challenging due to the variety of fonts, contents, writing
styles and the quality of documents that have degraded through the years.
In this paper, we address the limitations that currently prevent people from
building line segmentation models with a high generalization capacity. We
present a study conducted using three state-of-the-art systems Doc-UFCN,
dhSegment and ARU-Net and show that it is possible to build generic models
trained on a wide variety of historical document datasets that can correctly
segment diverse unseen pages. This paper also highlights the importance of the
annotations used during training: each existing dataset is annotated
differently. We present a unification of the annotations and show its positive
impact on the final text recognition results. In this end, we present a
complete evaluation strategy using standard pixel-level metrics, object-level
ones and introducing goal-oriented metrics.
Related papers
- Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines.
Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Automatic Document Sketching: Generating Drafts from Analogous Texts [44.626645471195495]
We introduce a new task, document sketching, which involves generating entire draft documents for the writer to review and revise.
These drafts are built from sets of documents that overlap in form - sharing large segments of potentially reusable text - while diverging in content.
We investigate the application of weakly supervised methods, including use of a transformer-based mixture of experts, together with reinforcement learning.
arXiv Detail & Related papers (2021-06-14T06:46:06Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z) - Whole page recognition of historical handwriting [1.2183405753834562]
We investigate an end-to-end inference approach without text localization which takes a handwritten page and transcribes its full text.
No explicit character, word or line segmentation is involved in inference which is why we call this approach "segmentation free"
We conclude that a whole page inference approach without text localization and segmentation is competitive.
arXiv Detail & Related papers (2020-09-22T15:46:33Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.