Neural OCR Post-Hoc Correction of Historical Corpora
- URL: http://arxiv.org/abs/2102.00583v1
- Date: Mon, 1 Feb 2021 01:35:55 GMT
- Title: Neural OCR Post-Hoc Correction of Historical Corpora
- Authors: Lijun Lyu, Maria Koutraki, Martin Krickl, Besnik Fetahu
- Abstract summary: We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
- Score: 4.427447378048202
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Optical character recognition (OCR) is crucial for a deeper access to
historical collections. OCR needs to account for orthographic variations,
typefaces, or language evolution (i.e., new letters, word spellings), as the
main source of character, word, or word segmentation transcription errors. For
digital corpora of historical prints, the errors are further exacerbated due to
low scan quality and lack of language standardization.
For the task of OCR post-hoc correction, we propose a neural approach based
on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to
correct OCR transcription errors. At character level we flexibly capture
errors, and decode the corrected output based on a novel attention mechanism.
Accounting for the input and output similarity, we propose a new loss function
that rewards the model's correcting behavior.
Evaluation on a historical book corpus in German language shows that our
models are robust in capturing diverse OCR transcription errors and reduce the
word error rate of 32.3% by more than 89%.
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts [0.934612743192798]
This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data.
An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors.
arXiv Detail & Related papers (2023-07-30T12:59:06Z) - Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts [0.934612743192798]
OCR technology is error-prone, especially when an OCRed document was written hundreds of years ago.
Neural networks have shown great success in solving various text processing tasks, including OCR post-correction.
The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets.
arXiv Detail & Related papers (2023-07-30T12:40:31Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - FastCorrect: Fast Error Correction with Edit Alignment for Automatic
Speech Recognition [90.34177266618143]
We propose FastCorrect, a novel NAR error correction model based on edit alignment.
FastCorrect speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model.
It outperforms the accuracy of popular NAR models adopted in neural machine translation by a large margin.
arXiv Detail & Related papers (2021-05-09T05:35:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.