Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts
- URL: http://arxiv.org/abs/2307.16213v1
- Date: Sun, 30 Jul 2023 12:40:31 GMT
- Title: Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts
- Authors: Omri Suissa, Maayan Zhitomirsky-Geffet, Avshalom Elmalech
- Abstract summary: OCR technology is error-prone, especially when an OCRed document was written hundreds of years ago.
Neural networks have shown great success in solving various text processing tasks, including OCR post-correction.
The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets.
- Score: 0.934612743192798
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the past few decades, large archives of paper-based historical
documents, such as books and newspapers, have been digitized using the Optical
Character Recognition (OCR) technology. Unfortunately, this broadly used
technology is error-prone, especially when an OCRed document was written
hundreds of years ago. Neural networks have shown great success in solving
various text processing tasks, including OCR post-correction. The main
disadvantage of using neural networks for historical corpora is the lack of
sufficiently large training datasets they require to learn from, especially for
morphologically-rich languages like Hebrew. Moreover, it is not clear what are
the optimal structure and values of hyperparameters (predefined parameters) of
neural networks for OCR error correction in Hebrew due to its unique features.
Furthermore, languages change across genres and periods. These changes may
affect the accuracy of OCR post-correction neural network models. To overcome
these challenges, we developed a new multi-phase method for generating
artificial training datasets with OCR errors and hyperparameters optimization
for building an effective neural network for OCR post-correction in Hebrew.
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts [0.934612743192798]
This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data.
An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors.
arXiv Detail & Related papers (2023-07-30T12:59:06Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - Recognizing Long Grammatical Sequences Using Recurrent Networks
Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction.
RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems.
One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack.
In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.