A Tool for Facilitating OCR Postediting in Historical Documents
- URL: http://arxiv.org/abs/2004.11471v1
- Date: Thu, 23 Apr 2020 21:40:30 GMT
- Title: A Tool for Facilitating OCR Postediting in Historical Documents
- Authors: Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, Andy Way
- Abstract summary: This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents.
The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom.
- Score: 6.1335228645093265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical character recognition (OCR) for historical documents is a complex
procedure subject to a unique set of material issues, including inconsistencies
in typefaces and low quality scanning. Consequently, even the most
sophisticated OCR engines produce errors. This paper reports on a tool built
for postediting the output of Tesseract, more specifically for correcting
common errors in digitized historical documents. The proposed tool suggests
alternatives for word forms not found in a specified vocabulary. The assumed
error is replaced by a presumably correct alternative in the post-edition based
on the scores of a Language Model (LM). The tool is tested on a chapter of the
book An Essay Towards Regulating the Trade and Employing the Poor of this
Kingdom (Cary ,1719). As demonstrated below, the tool is successful in
correcting a number of common errors. If sometimes unreliable, it is also
transparent and subject to human intervention.
Related papers
- FactCheck Editor: Multilingual Text Editor with End-to-End fact-checking [1.985242455423935]
'FactCheck Editor' is an advanced text editor designed to automate fact-checking and correct factual inaccuracies.
It supports over 90 languages and utilizes transformer models to assist humans in the labor-intensive process of fact verification.
arXiv Detail & Related papers (2024-04-30T11:55:20Z) - GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence [64.95492752484171]
We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks.
We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users.
To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision.
arXiv Detail & Related papers (2024-02-19T21:45:55Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - Transformer-Based UNet with Multi-Headed Cross-Attention Skip
Connections to Eliminate Artifacts in Scanned Documents [0.0]
A modified UNet structure using a Swin Transformer backbone is presented to remove typical artifacts in scanned documents.
An improvement in text extraction quality with a reduced error rate of up to 53.9% on the synthetic data is archived.
arXiv Detail & Related papers (2023-06-05T12:12:23Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Factual Error Correction for Abstractive Summaries Using Entity
Retrieval [57.01193722520597]
We propose an efficient factual error correction system RFEC based on entities retrieval post-editing process.
RFEC retrieves the evidence sentences from the original document by comparing the sentences with the target summary.
Next, RFEC detects the entity-level errors in the summaries by considering the evidence sentences and substitutes the wrong entities with the accurate entities from the evidence sentences.
arXiv Detail & Related papers (2022-04-18T11:35:02Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts [4.773188087436866]
We consider the issue of deduplication in the presence of optical character recognition (OCR) errors.
We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset.
We show that our method corrects over six times as many errors as it introduces.
arXiv Detail & Related papers (2021-10-22T17:33:17Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - An Unsupervised method for OCR Post-Correction and Spelling
Normalisation for Finnish [1.0957528713294875]
Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods.
We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model.
Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation.
arXiv Detail & Related papers (2020-11-06T18:19:48Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.