Related papers: Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

URL: http://arxiv.org/abs/2404.06101v1
Date: Tue, 9 Apr 2024 08:08:03 GMT
Title: Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines
Authors: Blnd Yaseen, Hossein Hassani,
Abstract summary: Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
Score: 1.174020933567308
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Related papers

QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation [0.8944616102795021]
We present Qari-OCR, a vision-language models progressively optimized for Arabic.<n>Qari-OCR establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts.
arXiv Detail & Related papers (2025-06-02T22:21:06Z)
SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition [0.995313069446686]
SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents.<n>It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage.<n>Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training.
arXiv Detail & Related papers (2025-05-30T13:47:54Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text. This technique generates high-precision pseudo-page-to-page labels for diacritic languages. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z)
Post-OCR Text Correction for Bulgarian Historical Documents [31.072768715994318]
We create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019
arXiv Detail & Related papers (2024-08-31T19:27:46Z)
Ancient but Digitized: Developing Handwritten Optical Character Recognition for East Syriac Script Through Creating KHAMIS Dataset [1.174020933567308]
This paper reports on a research project aimed at developing a optical character recognition (OCR) model based on the handwritten Syriac texts. A dataset was created, KHAMIS, which consists of handwritten sentences in the East Syriac script. The data was collected from volunteers capable of reading and writing in the language to create KHAMIS. The handwritten OCR model was able to achieve a character error rate of 1.097-1.610% and 8.963-10.490% on both training and evaluation sets.
arXiv Detail & Related papers (2024-08-24T17:17:46Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents. We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z)
EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package. It meets both the computational and sample efficiency requirements for liberating texts at scale. EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
An empirical study of CTC based models for OCR of Indian languages [31.5002680968116]
Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR. We present a study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence. We also introduce a new public dataset called Mozhi for word and line recognition in Indian language.
arXiv Detail & Related papers (2022-05-13T16:19:21Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages. We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting. We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.