Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts
- URL: http://arxiv.org/abs/2307.16213v1
- Date: Sun, 30 Jul 2023 12:40:31 GMT
- Title: Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts
- Authors: Omri Suissa, Maayan Zhitomirsky-Geffet, Avshalom Elmalech
- Abstract summary: OCR technology is error-prone, especially when an OCRed document was written hundreds of years ago.
Neural networks have shown great success in solving various text processing tasks, including OCR post-correction.
The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets.
- Score: 0.934612743192798
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the past few decades, large archives of paper-based historical
documents, such as books and newspapers, have been digitized using the Optical
Character Recognition (OCR) technology. Unfortunately, this broadly used
technology is error-prone, especially when an OCRed document was written
hundreds of years ago. Neural networks have shown great success in solving
various text processing tasks, including OCR post-correction. The main
disadvantage of using neural networks for historical corpora is the lack of
sufficiently large training datasets they require to learn from, especially for
morphologically-rich languages like Hebrew. Moreover, it is not clear what are
the optimal structure and values of hyperparameters (predefined parameters) of
neural networks for OCR error correction in Hebrew due to its unique features.
Furthermore, languages change across genres and periods. These changes may
affect the accuracy of OCR post-correction neural network models. To overcome
these challenges, we developed a new multi-phase method for generating
artificial training datasets with OCR errors and hyperparameters optimization
for building an effective neural network for OCR post-correction in Hebrew.
Related papers
- Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts [0.934612743192798]
This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data.
An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors.
arXiv Detail & Related papers (2023-07-30T12:59:06Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - LongFNT: Long-form Speech Recognition with Factorized Neural Transducer [64.75547712366784]
We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor.
The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate(WER) reduction, respectively.
arXiv Detail & Related papers (2022-11-17T08:48:27Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter
Evaluation [0.0]
We train a high quality optical character recognition (OCR) model for difficult historical typefaces on degraded paper.
We are able to obtain a 0.44% character error rate (CER) model from only 10,000 lines of training data.
We show ablations for all components of our training pipeline, which relies on the open source framework Calamari.
arXiv Detail & Related papers (2020-08-06T17:20:56Z) - Recognizing Long Grammatical Sequences Using Recurrent Networks
Augmented With An External Differentiable Stack [73.48927855855219]
Recurrent neural networks (RNNs) are a widely used deep architecture for sequence modeling, generation, and prediction.
RNNs generalize poorly over very long sequences, which limits their applicability to many important temporal processing and time series forecasting problems.
One way to address these shortcomings is to couple an RNN with an external, differentiable memory structure, such as a stack.
In this paper, we improve the memory-augmented RNN with important architectural and state updating mechanisms.
arXiv Detail & Related papers (2020-04-04T14:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.