Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts
- URL: http://arxiv.org/abs/2307.16220v1
- Date: Sun, 30 Jul 2023 12:59:06 GMT
- Title: Optimizing the Neural Network Training for OCR Error Correction of
Historical Hebrew Texts
- Authors: Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet
- Abstract summary: This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data.
An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors.
- Score: 0.934612743192798
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Over the past few decades, large archives of paper-based documents such as
books and newspapers have been digitized using Optical Character Recognition.
This technology is error-prone, especially for historical documents. To correct
OCR errors, post-processing algorithms have been proposed based on natural
language analysis and machine learning techniques such as neural networks.
Neural network's disadvantage is the vast amount of manually labeled data
required for training, which is often unavailable. This paper proposes an
innovative method for training a light-weight neural network for Hebrew OCR
post-correction using significantly less manually created data. The main
research goal is to develop a method for automatically generating language and
task-specific training data to improve the neural network results for OCR
post-correction, and to investigate which type of dataset is the most effective
for OCR post-correction of historical documents. To this end, a series of
experiments using several datasets was conducted. The evaluation corpus was
based on Hebrew newspapers from the JPress project. An analysis of historical
OCRed newspapers was done to learn common language and corpus-specific OCR
errors. We found that training the network using the proposed method is more
effective than using randomly generated errors. The results also show that the
performance of the neural network for OCR post-correction strongly depends on
the genre and area of the training data. Moreover, neural networks that were
trained with the proposed method outperform other state-of-the-art neural
networks for OCR post-correction and complex spellcheckers. These results may
have practical implications for many digital humanities projects.
Related papers
- Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - Toward a Period-Specific Optimized Neural Network for OCR Error
Correction of Historical Hebrew Texts [0.934612743192798]
OCR technology is error-prone, especially when an OCRed document was written hundreds of years ago.
Neural networks have shown great success in solving various text processing tasks, including OCR post-correction.
The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets.
arXiv Detail & Related papers (2023-07-30T12:40:31Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Learning to Learn with Generative Models of Neural Network Checkpoints [71.06722933442956]
We construct a dataset of neural network checkpoints and train a generative model on the parameters.
We find that our approach successfully generates parameters for a wide range of loss prompts.
We apply our method to different neural network architectures and tasks in supervised and reinforcement learning.
arXiv Detail & Related papers (2022-09-26T17:59:58Z) - Reconstructing Training Data from Trained Neural Networks [42.60217236418818]
We show in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier.
We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods.
arXiv Detail & Related papers (2022-06-15T18:35:16Z) - A Survey on Non-Autoregressive Generation for Neural Machine Translation
and Beyond [145.43029264191543]
Non-autoregressive (NAR) generation is first proposed in machine translation (NMT) to speed up inference.
While NAR generation can significantly accelerate machine translation, the inference of autoregressive (AR) generation sacrificed translation accuracy.
Many new models and algorithms have been designed/proposed to bridge the accuracy gap between NAR generation and AR generation.
arXiv Detail & Related papers (2022-04-20T07:25:22Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Local Critic Training for Model-Parallel Learning of Deep Neural
Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training.
We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z) - On the Accuracy of CRNNs for Line-Based OCR: A Multi-Parameter
Evaluation [0.0]
We train a high quality optical character recognition (OCR) model for difficult historical typefaces on degraded paper.
We are able to obtain a 0.44% character error rate (CER) model from only 10,000 lines of training data.
We show ablations for all components of our training pipeline, which relies on the open source framework Calamari.
arXiv Detail & Related papers (2020-08-06T17:20:56Z) - Multi-fidelity Neural Architecture Search with Knowledge Distillation [69.09782590880367]
We propose a bayesian multi-fidelity method for neural architecture search: MF-KD.
Knowledge distillation adds to a loss function a term forcing a network to mimic some teacher network.
We show that training for a few epochs with such a modified loss function leads to a better selection of neural architectures than training for a few epochs with a logistic loss.
arXiv Detail & Related papers (2020-06-15T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.