A Tool for Facilitating OCR Postediting in Historical Documents
- URL: http://arxiv.org/abs/2004.11471v1
- Date: Thu, 23 Apr 2020 21:40:30 GMT
- Title: A Tool for Facilitating OCR Postediting in Historical Documents
- Authors: Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way
- Abstract summary: This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents.
The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom.
- Score: 6.1335228645093265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical character recognition (OCR) for historical documents is a complex
procedure subject to a unique set of material issues, including inconsistencies
in typefaces and low quality scanning. Consequently, even the most
sophisticated OCR engines produce errors. This paper reports on a tool built
for postediting the output of Tesseract, more specifically for correcting
common errors in digitized historical documents. The proposed tool suggests
alternatives for word forms not found in a specified vocabulary. The assumed
error is replaced by a presumably correct alternative in the post-edition based
on the scores of a Language Model (LM). The tool is tested on a chapter of the
book An Essay Towards Regulating the Trade and Employing the Poor of this
Kingdom (Cary ,1719). As demonstrated below, the tool is successful in
correcting a number of common errors. If sometimes unreliable, it is also
transparent and subject to human intervention.
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - FactCheck Editor: Multilingual Text Editor with End-to-End fact-checking [1.985242455423935]
'FactCheck Editor' is an advanced text editor designed to automate fact-checking and correct factual inaccuracies.
It supports over 90 languages and utilizes transformer models to assist humans in the labor-intensive process of fact verification.
arXiv Detail & Related papers (2024-04-30T11:55:20Z) - GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence [64.95492752484171]
We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks.
We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users.
To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision.
arXiv Detail & Related papers (2024-02-19T21:45:55Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - Transformer-Based UNet with Multi-Headed Cross-Attention Skip
Connections to Eliminate Artifacts in Scanned Documents [0.0]
A modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents.
An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived.
arXiv Detail & Related papers (2023-06-05T12:12:23Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts [4.773188087436866]
We consider the issue of deduplication in the presence of optical character recognition (OCR) errors.
We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset.
We show that our method corrects over six times as many errors as it introduces.
arXiv Detail & Related papers (2021-10-22T17:33:17Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.