Post-OCR Text Correction for Bulgarian Historical Documents
- URL: http://arxiv.org/abs/2409.00527v1
- Date: Sat, 31 Aug 2024 19:27:46 GMT
- Title: Post-OCR Text Correction for Bulgarian Historical Documents
- Authors: Angel Beshirov, Milena Dobreva, Dimitar Dimitrov, Momchil Hardalov, Ivan Koychev, Preslav Nakov,
- Abstract summary: We create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century.
We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction.
The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019
- Score: 31.072768715994318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The digitization of historical documents is crucial for preserving the cultural heritage of the society. An important step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a hard problem as standard OCR tools are not tailored to deal with historical orthography as well as with challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25\%, which is an increase of 16\% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at \url{https://github.com/angelbeshirov/post-ocr-text-correction}.}
Related papers
- Reference-Based Post-OCR Processing with LLM for Diacritic Languages [0.0]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.
This technique generates high-precision pseudo-page-to-page labels for diacritic languages.
The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - Data Generation for Post-OCR correction of Cyrillic handwriting [41.94295877935867]
This paper focuses on the development and application of a synthetic handwriting generation engine based on B'ezier curves.
Such an engine generates highly realistic handwritten text in any amounts, which we utilize to create a substantial dataset.
We apply a Handwritten Text Recognition (HTR) model to this dataset to identify OCR errors, forming the basis for our POC model training.
arXiv Detail & Related papers (2023-11-27T15:01:26Z) - PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z) - Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts [12.346821696831805]
We present a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output.
This paper is divided into four sections: dataset, model architecture, training and analysis.
arXiv Detail & Related papers (2023-04-07T00:45:12Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.