Data Generation for Post-OCR correction of Cyrillic handwriting
- URL: http://arxiv.org/abs/2311.15896v1
- Date: Mon, 27 Nov 2023 15:01:26 GMT
- Title: Data Generation for Post-OCR correction of Cyrillic handwriting
- Authors: Evgenii Davydkin, Aleksandr Markelov, Egor Iuldashev, Anton Dudkin,
Ivan Krivorotov
- Abstract summary: This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
- Score: 41.94295877935867
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel approach to post-Optical Character Recognition
Correction (POC) for handwritten Cyrillic text, addressing a significant gap in
current research methodologies. This gap is due to the lack of large text
corporas that provide OCR errors for further training of language-based POC
models, which are demanding in terms of corpora size. Our study primarily
focuses on the development and application of a synthetic handwriting
generation engine based on B\'ezier curves. Such an engine generates highly
realistic handwritten text in any amounts, which we utilize to create a
substantial dataset by transforming Russian text corpora sourced from the
internet. We apply a Handwritten Text Recognition (HTR) model to this dataset
to identify OCR errors, forming the basis for our POC model training. The
correction model is trained on a 90-symbol input context, utilizing a
pre-trained T5 architecture with a seq2seq correction task. We evaluate our
approach on HWR200 and School_notebooks_RU datasets as they provide significant
challenges in the HTR domain. Furthermore, POC can be used to highlight errors
for teachers, evaluating student performance. This can be done simply by
comparing sentences before and after correction, displaying differences in
text. Our primary contribution lies in the innovative use of B\'ezier curves
for Cyrillic text generation and subsequent error correction using a
specialized POC model. We validate our approach by presenting Word Accuracy
Rate (WAR) and Character Accuracy Rate (CAR) results, both with and without
post-OCR correction, using real open corporas of handwritten Cyrillic text.
These results, coupled with our methodology, are designed to be reproducible,
paving the way for further advancements in the field of OCR and handwritten
text analysis. Paper contributions can be found in
https://github.com/dbrainio/CyrillicHandwritingPOC
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - Classification of Non-native Handwritten Characters Using Convolutional Neural Network [0.0]
The classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model.
We train this CNN with a new dataset called the handwritten isolated English character dataset.
The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy.
arXiv Detail & Related papers (2024-06-06T21:08:07Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts [12.346821696831805]
We present a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output.
This paper is divided into four sections: dataset, model architecture, training and analysis.
arXiv Detail & Related papers (2023-04-07T00:45:12Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.