IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical
Character Recognition
- URL: http://arxiv.org/abs/2312.01177v1
- Date: Sat, 2 Dec 2023 16:56:57 GMT
- Title: IDPL-PFOD2: A New Large-Scale Dataset for Printed Farsi Optical
Character Recognition
- Authors: Fatemeh Asadi-zeydabadi, Ali Afkari-Fahandari, Amin Faraji, Elham
Shabaninia, Hossein Nezamabadi-pour
- Abstract summary: This paper presents a novel large-scale dataset, IDPL-PFOD2, tailored for Farsi printed text recognition.
The dataset comprises 2003541 images featuring a wide variety of fonts, styles, and sizes.
- Score: 6.780778335996319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical Character Recognition is a technique that converts document images
into searchable and editable text, making it a valuable tool for processing
scanned documents. While the Farsi language stands as a prominent and official
language in Asia, efforts to develop efficient methods for recognizing Farsi
printed text have been relatively limited. This is primarily attributed to the
languages distinctive features, such as cursive form, the resemblance between
certain alphabet characters, and the presence of numerous diacritics and dot
placement. On the other hand, given the substantial training sample
requirements of deep-based architectures for effective performance, the
development of such datasets holds paramount significance. In light of these
concerns, this paper aims to present a novel large-scale dataset, IDPL-PFOD2,
tailored for Farsi printed text recognition. The dataset comprises 2003541
images featuring a wide variety of fonts, styles, and sizes. This dataset is an
extension of the previously introduced IDPL-PFOD dataset, offering a
substantial increase in both volume and diversity. Furthermore, the datasets
effectiveness is assessed through the utilization of both CRNN-based and Vision
Transformer architectures. The CRNN-based model achieves a baseline accuracy
rate of 78.49% and a normalized edit distance of 97.72%, while the Vision
Transformer architecture attains an accuracy of 81.32% and a normalized edit
distance of 98.74%.
Related papers
- UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - A Permuted Autoregressive Approach to Word-Level Recognition for Urdu Digital Text [2.2012643583422347]
This research paper introduces a novel word-level Optical Character Recognition (OCR) model specifically designed for digital Urdu text.
The model employs a permuted autoregressive sequence (PARSeq) architecture, which enhances its performance.
The model demonstrates a high level of accuracy in capturing the intricacies of Urdu script, achieving a CER of 0.178.
arXiv Detail & Related papers (2024-08-27T14:58:13Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - Exploring Better Text Image Translation with Multimodal Codebook [39.12169843196739]
Text image translation (TIT) aims to translate the source texts embedded in the image to target translations.
In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies.
Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts.
We present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts.
arXiv Detail & Related papers (2023-05-27T08:41:18Z) - Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z) - Line Segmentation from Unconstrained Handwritten Text Images using
Adaptive Approach [10.436029791699777]
Line segmentation from handwritten text images is a challenging task due to diversity and unknown variations.
An adaptive approach is used for the line segmentation from handwritten text images merging the alignment of connected component coordinates and text height.
The proposed scheme is tested on two different type of datasets; document pages having base lines and plain pages.
arXiv Detail & Related papers (2021-04-18T08:52:52Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - TLGAN: document Text Localization using Generative Adversarial Nets [2.1378501793514277]
Text localization from digital image is first step for optical character recognition.
Deep neural networks are used to perform text localization from digital image.
Training only ten labeled receipt images from Robust Reading Challenge on Scanned Receipts OCR and Information Extraction.
TLGAN achieved 99.83% precision and 99.64% recall for SROIE test data.
arXiv Detail & Related papers (2020-10-22T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.