Related papers: An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

URL: http://arxiv.org/abs/2004.07317v1
Date: Wed, 15 Apr 2020 20:05:54 GMT
Title: An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers
Authors: Bernhard Liebl and Manuel Burghardt
Abstract summary: We evaluate 11 different published Deep Neural Networks backbone architectures and 9 different tiling and scaling configurations for separating text, tables or table column lines. We show the influence of the number of labels and the number of training pages on the segmentation quality, which we measure using the Matthews Correlation Coefficient. Our results show that (depending on the task) Inception-ResNet-v2 and EfficientNet backbones work best, vertical tiling is generally preferable to other tiling approaches.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One important and particularly challenging step in the optical character recognition (OCR) of historical documents with complex layouts, such as newspapers, is the separation of text from non-text content (e.g. page borders or illustrations). This step is commonly referred to as page segmentation. While various rule-based algorithms have been proposed, the applicability of Deep Neural Networks (DNNs) for this task recently has gained a lot of attention. In this paper, we perform a systematic evaluation of 11 different published DNN backbone architectures and 9 different tiling and scaling configurations for separating text, tables or table column lines. We also show the influence of the number of labels and the number of training pages on the segmentation quality, which we measure using the Matthews Correlation Coefficient. Our results show that (depending on the task) Inception-ResNet-v2 and EfficientNet backbones work best, vertical tiling is generally preferable to other tiling approaches, and training data that comprises 30 to 40 pages will be sufficient most of the time.

Related papers

Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents [1.4065611645922207]
In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives.<n>Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union.
arXiv Detail & Related papers (2025-08-26T16:11:32Z)
The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure [91.01653854955286]
Embedding-based similarity metrics can be influenced by spurious attributes like the text's source or language.<n>This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost.
arXiv Detail & Related papers (2025-07-01T23:17:12Z)
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z)
Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order. The model is language-agnostic and runs effectively across multi-language datasets. It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z)
Semantic Parsing of Interpage Relations [0.0]
We formalize the task as semantic parsing of interpage relations and we propose an end-to-end approach for interpage dependency extraction. We also design a multi-task training approach to jointly optimize for page embeddings to be used in segmentation, classification, and parsing of the page dependencies. Our experimental results show that the proposed method increased LAS by 41 percentage points for semantic parsing, increased accuracy by 33 percentage points for page stream segmentation, and 45 percentage points for page classification over a naive baseline.
arXiv Detail & Related papers (2022-05-26T17:50:43Z)
Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods [1.9938405188113029]
We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net. We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
arXiv Detail & Related papers (2022-03-23T11:56:25Z)
Combining Morphological and Histogram based Text Line Segmentation in the OCR Context [0.0]
Algorithmic approach proposed by this paper has been designed for this exact purpose. The method was developed to be applied on a historic data collection that commonly features quality issues. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg.
arXiv Detail & Related papers (2021-03-16T09:06:25Z)
Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
Topical Change Detection in Documents via Embeddings of Long Sequences [4.13878392637062]
We formulate the task of text segmentation as an independent supervised prediction task. By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information. Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context.
arXiv Detail & Related papers (2020-12-07T12:09:37Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
A Graph Representation of Semi-structured Data for Web Question Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z)
Neural Abstractive Summarization with Structural Attention [31.50918718905953]
We present a hierarchical encoder based on structural attention to model such inter-sentence and inter-document dependencies. We show that our proposed model achieves significant improvement over the baselines in both single and multi-document summarization settings.
arXiv Detail & Related papers (2020-04-21T03:39:15Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.