RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages
- URL: http://arxiv.org/abs/2412.15248v1
- Date: Sat, 14 Dec 2024 19:59:41 GMT
- Title: RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages
- Authors: Harshvivek Kashid, Pushpak Bhattacharyya,
- Abstract summary: We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR.
We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit.
We also present a novel approach for OCR error correction by leveraging techniques from machine translation.
- Score: 41.09752906121257
- License:
- Abstract: Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.
Related papers
- JaPOC: Japanese Post-OCR Correction Benchmark using Vouchers [0.0]
We create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems.
In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
arXiv Detail & Related papers (2024-09-30T05:01:49Z) - Scrambled text: training Language Models to correct OCR errors using synthetic data [0.0]
This paper shows that fine-tuning a language model on synthetic data can significantly improve the ability to correct OCR errors.
Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data.
arXiv Detail & Related papers (2024-09-29T15:20:37Z) - A Coin Has Two Sides: A Novel Detector-Corrector Framework for Chinese Spelling Correction [79.52464132360618]
Chinese Spelling Correction (CSC) stands as a foundational Natural Language Processing (NLP) task.
We introduce a novel approach based on error detector-corrector framework.
Our detector is designed to yield two error detection results, each characterized by high precision and recall.
arXiv Detail & Related papers (2024-09-06T09:26:45Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)
It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.
The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - OCR Improves Machine Translation for Low-Resource Languages [10.010595434359647]
We introduce and make publicly available a novel benchmark, textscOCR4MT, consisting of real and synthetic data, enriched with noise.
We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors.
We then perform an ablation study to investigate how OCR errors impact Machine Translation performance.
arXiv Detail & Related papers (2022-02-27T02:36:45Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Neural OCR Post-Hoc Correction of Historical Corpora [4.427447378048202]
We propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors.
We show that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
arXiv Detail & Related papers (2021-02-01T01:35:55Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.