Related papers: Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

Combining Morphological and Histogram based Text Line Segmentation in the OCR Context

URL: http://arxiv.org/abs/2103.08922v1
Date: Tue, 16 Mar 2021 09:06:25 GMT
Title: Combining Morphological and Histogram based Text Line Segmentation in the OCR Context
Authors: Pit Schneider
Abstract summary: Algorithmic approach proposed by this paper has been designed for this exact purpose. The method was developed to be applied on a historic data collection that commonly features quality issues. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Its main characteristic is the combination of two different techniques, morphological image operations and horizontal histogram projections. The method was developed to be applied on a historic data collection that commonly features quality issues, such as degraded paper, blurred text, or curved text lines. For that reason, the segmenter in question could be of particular interest for cultural institutions, such as libraries, archives, museums, ..., that want access to robust line bounding boxes for a given historic document. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection. The general contribution of this paper is to outline the approach and to evaluate the gains in terms of accuracy and speed, comparing it to the segmentation algorithm bundled with the used open source OCR software.

Related papers

Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech [61.00008468914252]
We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation.<n> benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing.<n>Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript.<n>Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost.
arXiv Detail & Related papers (2025-12-30T23:29:51Z)
A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models [71.66119575697458]
parallel text generation techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency.<n>We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category.<n>We highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation.
arXiv Detail & Related papers (2025-08-12T07:56:04Z)
TextBite: A Historical Czech Document Dataset for Logical Page Segmentation [0.0]
Previous approaches have relied on OCR or precise geometry to define logical segmentation. To avoid the need for OCR, we define the task purely as segmentation in the image domain. We introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text.
arXiv Detail & Related papers (2025-03-20T19:19:12Z)
SegHist: A General Segmentation-based Framework for Chinese Historical Document Text Line Detection [10.08588082910962]
Text line detection is a key task in historical document analysis. We propose a general framework for historical document text detection (SegHist) Integrating the SegHist framework with the commonly used method DB++, we develop DB-SegHist.
arXiv Detail & Related papers (2024-06-17T11:00:04Z)
The CLRS-Text Algorithmic Reasoning Language Benchmark [48.45201665463275]
CLRS-Text is a textual version of the CLRS benchmark. CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks. We fine-tune and evaluate various LMs as generalist executors on this benchmark.
arXiv Detail & Related papers (2024-06-06T16:29:25Z)
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z)
Segmenting Messy Text: Detecting Boundaries in Text Derived from Historical Newspaper Images [0.0]
We consider a challenging text segmentation task: dividing newspaper marriage announcement lists into units of one announcement each. In many cases the information is not structured into sentences, and adjacent segments are not topically distinct from each other. We present a novel deep learning-based model for segmenting such text and show that it significantly outperforms an existing state-of-the-art method on our task.
arXiv Detail & Related papers (2023-12-20T05:17:06Z)
Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents [0.8158530638728501]
This paper evaluates the impact of image processing methods and parameter tuning in Optical Character Recognition (OCR) The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) Our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results.
arXiv Detail & Related papers (2023-11-27T11:44:46Z)
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP. Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z)
One-shot Compositional Data Generation for Low Resource Handwritten Text Recognition [10.473427493876422]
Low resource Handwritten Text Recognition is a hard problem due to the scarce annotated data and the very limited linguistic information. In this paper we address this problem through a data generation technique based on Bayesian Program Learning. Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol from the desired alphabet.
arXiv Detail & Related papers (2021-05-11T18:53:01Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
SCATTER: Selective Context Attentional Scene Text Recognizer [16.311256552979835]
Scene Text Recognition (STR) is the task of recognizing text against complex image backgrounds. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. We introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER)
arXiv Detail & Related papers (2020-03-25T09:20:28Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
TextScanner: Reading Characters in Order for Robust Scene Text Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition. It generates pixel-wise, multi-channel segmentation maps for character class, position and order. It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.