An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition
using a Novel Transformers-based Model and an Innovative 270 Million-Words
Multi-Font Corpus of Classical Arabic with Diacritics
- URL: http://arxiv.org/abs/2208.11484v1
- Date: Sat, 20 Aug 2022 22:21:19 GMT
- Title: An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition
using a Novel Transformers-based Model and an Innovative 270 Million-Words
Multi-Font Corpus of Classical Arabic with Diacritics
- Authors: Aly Mostafa, Omar Mohamed, Ali Ashraf, Ahmed Elbehery, Salma Jamal,
Anas Salah, Amr S. Ghoneim
- Abstract summary: This research is the second phase in a series of investigations on developing an Optical Character Recognition (OCR) of Arabic historical documents.
We propose an end-to-end text recognition approach using Vision Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder, eliminating CNNs for feature extraction and reducing the model's complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This research is the second phase in a series of investigations on developing
an Optical Character Recognition (OCR) of Arabic historical documents and
examining how different modeling procedures interact with the problem. The
first research studied the effect of Transformers on our custom-built Arabic
dataset. One of the downsides of the first research was the size of the
training data, a mere 15000 images from our 30 million images, due to lack of
resources. Also, we add an image enhancement layer, time and space
optimization, and Post-Correction layer to aid the model in predicting the
correct word for the correct context. Notably, we propose an end-to-end text
recognition approach using Vision Transformers as an encoder, namely BEIT, and
vanilla Transformer as a decoder, eliminating CNNs for feature extraction and
reducing the model's complexity. The experiments show that our end-to-end model
outperforms Convolutions Backbones. The model attained a CER of 4.46%.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Comparative study of Transformer and LSTM Network with attention
mechanism on Image Captioning [0.0]
This study compares Transformer and LSTM with attention block model on MS-COCO dataset.
Transformer and LSTM with attention block models have been discussed with state of the art accuracy.
arXiv Detail & Related papers (2023-03-05T11:45:53Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.