Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts
- URL: http://arxiv.org/abs/2110.11934v1
- Date: Fri, 22 Oct 2021 17:33:17 GMT
- Title: Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts
- Authors: Allen Kim, Charuta Pethe, Naoya Inoue and Steve Skiena
- Abstract summary: We consider the issue of deduplication in the presence of optical character recognition (OCR) errors.
We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset.
We show that our method corrects over six times as many errors as it introduces.
- Score: 4.773188087436866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Substantial amounts of work are required to clean large collections of
digitized books for NLP analysis, both because of the presence of errors in the
scanned text and the presence of duplicate volumes in the corpora. In this
paper, we consider the issue of deduplication in the presence of optical
character recognition (OCR) errors. We present methods to handle these errors,
evaluated on a collection of 19,347 texts from the Project Gutenberg dataset
and 96,635 texts from the HathiTrust Library. We demonstrate that improvements
in language models now enable the detection and correction of OCR errors
without consideration of the scanning image itself. The inconsistencies found
by aligning pairs of scans of the same underlying work provides training data
to build models for detecting and correcting errors. We identify the canonical
version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally,
we investigate methods to detect and correct errors in single-copy texts. We
show that on average, our method corrects over six times as many errors as it
introduces. We also provide interesting analysis on the relation between
scanning quality and other factors such as location and publication year.
Related papers
- Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - Is it an i or an l: Test-time Adaptation of Text Line Recognition Models [9.149602257966917]
We introduce the problem of adapting text line recognition models during test time.
We propose an iterative self-training approach that uses feedback from the language model to update the optical model.
Experimental results show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate.
arXiv Detail & Related papers (2023-08-29T05:44:00Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Transformer-Based UNet with Multi-Headed Cross-Attention Skip
Connections to Eliminate Artifacts in Scanned Documents [0.0]
A modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents.
An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived.
arXiv Detail & Related papers (2023-06-05T12:12:23Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - A Tool for Facilitating OCR Postediting in Historical Documents [6.1335228645093265]
This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents.
The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom.
arXiv Detail & Related papers (2020-04-23T21:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.