A Novel Pipeline for Improving Optical Character Recognition through
Post-processing Using Natural Language Processing
- URL: http://arxiv.org/abs/2307.04245v1
- Date: Sun, 9 Jul 2023 18:51:17 GMT
- Title: A Novel Pipeline for Improving Optical Character Recognition through
Post-processing Using Natural Language Processing
- Authors: Aishik Rakshit, Samyak Mehta, Anirban Dasgupta
- Abstract summary: We propose a post-processing approach using Natural Language Processing (NLP) tools.
This work presents an end-to-end pipeline that first performs OCR on the handwritten or printed text and then improves its accuracy using NLP.
- Score: 2.9499386124223257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optical Character Recognition (OCR) technology finds applications in
digitizing books and unstructured documents, along with applications in other
domains such as mobility statistics, law enforcement, traffic, security
systems, etc. The state-of-the-art methods work well with the OCR with printed
text on license plates, shop names, etc. However, applications such as printed
textbooks and handwritten texts have limited accuracy with existing techniques.
The reason may be attributed to similar-looking characters and variations in
handwritten characters. Since these issues are challenging to address with OCR
technologies exclusively, we propose a post-processing approach using Natural
Language Processing (NLP) tools. This work presents an end-to-end pipeline that
first performs OCR on the handwritten or printed text and then improves its
accuracy using NLP.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - Optimization of Image Processing Algorithms for Character Recognition in
Cultural Typewritten Documents [0.8158530638728501]
This paper evaluates the impact of image processing methods and parameter tuning in Optical Character Recognition (OCR)
The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II)
Our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results.
arXiv Detail & Related papers (2023-11-27T11:44:46Z) - Text Detection Forgot About Document OCR [0.0]
This paper compares several methods designed for in-the-wild text recognition and for document text recognition.
The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve excellent results on document text detection.
arXiv Detail & Related papers (2022-10-14T15:37:54Z) - To show or not to show: Redacting sensitive text from videos of
electronic displays [4.621328863799446]
We define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques.
We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV)
arXiv Detail & Related papers (2022-08-19T07:53:04Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - SmartPatch: Improving Handwritten Word Imitation with Patch
Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods.
We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system.
This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z) - Unknown-box Approximation to Improve Optical Character Recognition
Performance [7.805544279853116]
A novel approach is presented for creating a customized preprocessor for a given OCR engine.
Experiments with two datasets and two OCR engines show that the presented preprocessor is able to improve the accuracy of the OCR up to 46% from the baseline.
arXiv Detail & Related papers (2021-05-17T16:09:15Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.