An Evaluation of DNN Architectures for Page Segmentation of Historical
Newspapers
- URL: http://arxiv.org/abs/2004.07317v1
- Date: Wed, 15 Apr 2020 20:05:54 GMT
- Title: An Evaluation of DNN Architectures for Page Segmentation of Historical
Newspapers
- Authors: Bernhard Liebl and Manuel Burghardt
- Abstract summary: We evaluate 11 different published Deep Neural Networks backbone architectures and 9 different tiling and scaling configurations for separating text, tables or table column lines.
We show the influence of the number of labels and the number of training pages on the segmentation quality, which we measure using the Matthews Correlation Coefficient.
Our results show that (depending on the task) Inception-ResNet-v2 and EfficientNet backbones work best, vertical tiling is generally preferable to other tiling approaches.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One important and particularly challenging step in the optical character
recognition (OCR) of historical documents with complex layouts, such as
newspapers, is the separation of text from non-text content (e.g. page borders
or illustrations). This step is commonly referred to as page segmentation.
While various rule-based algorithms have been proposed, the applicability of
Deep Neural Networks (DNNs) for this task recently has gained a lot of
attention. In this paper, we perform a systematic evaluation of 11 different
published DNN backbone architectures and 9 different tiling and scaling
configurations for separating text, tables or table column lines. We also show
the influence of the number of labels and the number of training pages on the
segmentation quality, which we measure using the Matthews Correlation
Coefficient. Our results show that (depending on the task) Inception-ResNet-v2
and EfficientNet backbones work best, vertical tiling is generally preferable
to other tiling approaches, and training data that comprises 30 to 40 pages
will be sufficient most of the time.
Related papers
- From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Text Reading Order in Uncontrolled Conditions by Sparse Graph
Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order.
The model is language-agnostic and runs effectively across multi-language datasets.
It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z) - Semantic Parsing of Interpage Relations [0.0]
We formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction.
We also design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies.
Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
arXiv Detail & Related papers (2022-05-26T17:50:43Z) - Robust Text Line Detection in Historical Documents: Learning and
Evaluation Methods [1.9938405188113029]
We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net.
We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
arXiv Detail & Related papers (2022-03-23T11:56:25Z) - Combining Morphological and Histogram based Text Line Segmentation in
the OCR Context [0.0]
Algorithmic approach proposed by this paper has been designed for this exact purpose.
The method was developed to be applied on a historic data collection that commonly features quality issues.
Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg.
arXiv Detail & Related papers (2021-03-16T09:06:25Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Topical Change Detection in Documents via Embeddings of Long Sequences [4.13878392637062]
We formulate the task of text segmentation as an independent supervised prediction task.
By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information.
Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context.
arXiv Detail & Related papers (2020-12-07T12:09:37Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - Neural Abstractive Summarization with Structural Attention [31.50918718905953]
We present a hierarchical encoder based on structural attention to model such inter-sentence and inter-document dependencies.
We show that our proposed model achieves significant improvement over the baselines in both single and multi-document summarization settings.
arXiv Detail & Related papers (2020-04-21T03:39:15Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.