Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts
- URL: http://arxiv.org/abs/2110.11934v1
- Date: Fri, 22 Oct 2021 17:33:17 GMT
- Title: Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts
- Authors: Allen Kim, Charuta Pethe, Naoya Inoue and Steve Skiena
- Abstract summary: We consider the issue of deduplication in the presence of optical character recognition (OCR) errors.
We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset.
We show that our method corrects over six times as many errors as it introduces.
- Score: 4.773188087436866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Substantial amounts of work are required to clean large collections of
digitized books for NLP analysis, both because of the presence of errors in the
scanned text and the presence of duplicate volumes in the corpora. In this
paper, we consider the issue of deduplication in the presence of optical
character recognition (OCR) errors. We present methods to handle these errors,
evaluated on a collection of 19,347 texts from the Project Gutenberg dataset
and 96,635 texts from the HathiTrust Library. We demonstrate that improvements
in language models now enable the detection and correction of OCR errors
without consideration of the scanning image itself. The inconsistencies found
by aligning pairs of scans of the same underlying work provides training data
to build models for detecting and correcting errors. We identify the canonical
version for each of 17,136 repeatedly-scanned books from 58,808 scans. Finally,
we investigate methods to detect and correct errors in single-copy texts. We
show that on average, our method corrects over six times as many errors as it
introduces. We also provide interesting analysis on the relation between
scanning quality and other factors such as location and publication year.
Related papers
- Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)
We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.
This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - Full-text Error Correction for Chinese Speech Recognition with Large Language Model [11.287933170894311]
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR)
This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings.
arXiv Detail & Related papers (2024-09-12T06:50:45Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - Is it an i or an l: Test-time Adaptation of Text Line Recognition Models [9.149602257966917]
We introduce the problem of adapting text line recognition models during test time.
We propose an iterative self-training approach that uses feedback from the language model to update the optical model.
Experimental results show that the proposed adaptation method offers an absolute improvement of up to 8% in character error rate.
arXiv Detail & Related papers (2023-08-29T05:44:00Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - A Tool for Facilitating OCR Postediting in Historical Documents [6.1335228645093265]
This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents.
The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom.
arXiv Detail & Related papers (2020-04-23T21:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.