Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images
- URL: http://arxiv.org/abs/2212.05525v3
- Date: Mon, 16 Oct 2023 16:11:58 GMT
- Title: Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned
Receipt Images
- Authors: Hongkuan Zhang, Edward Whittaker, Ikuo Kitagishi
- Abstract summary: We propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with randomly cropped image chunks.
In our experiments, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rate.
- Score: 0.07673339435080445
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Digitization of scanned receipts aims to extract text from receipt images and
save it into structured documents. This is usually split into two sub-tasks:
text localization and optical character recognition (OCR). Most existing OCR
models only focus on the cropped text instance images, which require the
bounding box information provided by a text region detection model. Introducing
an additional detector to identify the text instance images in advance adds
complexity, however instance-level OCR models have very low accuracy when
processing the whole image for the document-level OCR, such as receipt images
containing multiple text lines arranged in various layouts. To this end, we
propose a localization-free document-level OCR model for transcribing all the
characters in a receipt image into an ordered sequence end-to-end.
Specifically, we finetune the pretrained instance-level model TrOCR with
randomly cropped image chunks, and gradually increase the image chunk size to
generalize the recognition ability from instance images to full-page images. In
our experiments on the SROIE receipt OCR dataset, the model finetuned with our
strategy achieved 64.4 F1-score and a 22.8% character error rate (CER),
respectively, which outperforms the baseline results with 48.5 F1-score and
50.6% CER. The best model, which splits the full image into 15 equally sized
chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or
post-processing of the output. Moreover, the characters in the generated
document-level sequences are arranged in the reading order, which is practical
for real-world applications.
Related papers
- Mero Nagarikta: Advanced Nepali Citizenship Data Extractor with Deep Learning-Powered Text Detection and OCR [0.0]
This work proposes a robust system using YOLOv8 for accurate text object detection and an OCR algorithm based on Optimized PyTesseract.
The system, implemented within the context of a mobile application, allows for the automated extraction of important textual information.
The tested PyTesseract optimized for Nepali characters outperformed the standard OCR regarding flexibility and accuracy.
arXiv Detail & Related papers (2024-10-08T06:29:08Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - Optimization of Image Processing Algorithms for Character Recognition in
Cultural Typewritten Documents [0.8158530638728501]
This paper evaluates the impact of image processing methods and parameter tuning in Optical Character Recognition (OCR)
The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II)
Our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results.
arXiv Detail & Related papers (2023-11-27T11:44:46Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z) - TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped
scene text [23.04601165885908]
We propose TextOCR, an arbitrary-shaped scene text detection and recognition with 900k annotated words collected on real images.
We show that current state-of-the-art text-recognition (OCR) models fail to perform well on TextOCR.
We use a TextOCR trained OCR model to create PixelM4C model which can do scene text based reasoning on an image in an end-to-end fashion.
arXiv Detail & Related papers (2021-05-12T07:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.