bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
- URL: http://arxiv.org/abs/2308.10647v2
- Date: Tue, 22 Aug 2023 02:32:01 GMT
- Title: bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
- Authors: Imam Mohammad Zulkarnain, Shayekh Bin Islam, Md. Zami Al Zunaed
Farabe, Md. Mehedi Hasan Shawon, Jawaril Munshad Abedin, Beig Rajibul Hasan,
Marsia Haque, Istiak Shihab, Syed Mobassir, MD. Nazmuddoha Ansary, Asif
Sushmit, Farig Sadeque
- Abstract summary: We introduce Bengali$.$AI-BRACU-OCR (bbOCR), an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format.
Our proposed solution is preferable over the current state-of-the-art Bengali OCR systems.
- Score: 0.23639235997306196
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the existence of numerous Optical Character Recognition (OCR) tools,
the lack of comprehensive open-source systems hampers the progress of document
digitization in various low-resource languages, including Bengali. Low-resource
languages, especially those with an alphasyllabary writing system, suffer from
the lack of large-scale datasets for various document OCR components such as
word-level OCR, document layout extraction, and distortion correction; which
are available as individual modules in high-resource languages. In this paper,
we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document
OCR system that can reconstruct Bengali documents into a structured searchable
digitized format that leverages a novel Bengali text recognition model and two
novel synthetic datasets. We present extensive component-level and system-level
evaluation: both use a novel diversified evaluation dataset and comprehensive
evaluation metrics. Our extensive evaluation suggests that our proposed
solution is preferable over the current state-of-the-art Bengali OCR systems.
The source codes and datasets are available here:
https://bengaliai.github.io/bbocr.
Related papers
- CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset [12.828786692835369]
This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding dataset (CORU)
CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores.
We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods.
arXiv Detail & Related papers (2024-06-06T20:38:15Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - End-to-End Optical Character Recognition for Bengali Handwritten Words [0.0]
This paper introduces an end-to-end OCR system for Bengali language.
The proposed architecture implements an end to end strategy that recognises handwritten Bengali words from handwritten word images.
arXiv Detail & Related papers (2021-05-09T20:48:56Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - An Efficient Language-Independent Multi-Font OCR for Arabic Script [0.0]
This paper proposes a complete Arabic OCR system that takes a scanned image of Arabic Naskh script as an input and generates a corresponding digital document.
This paper also proposes an improved font-independent character algorithm that outperforms the state-of-the-art segmentation algorithms.
arXiv Detail & Related papers (2020-09-18T22:57:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.