Related papers: Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement

Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement

URL: http://arxiv.org/abs/2601.02965v1
Date: Tue, 06 Jan 2026 12:22:03 GMT
Title: Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement
Authors: Phat Tran, Phuoc Pham, Hung Trinh, Tho Quan,
Abstract summary: Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability.<n>This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology.<n>We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processings to enhance recognition accuracy.
Score: 3.2537431443459255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability. This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology. Digitizing scanned paper documents poses significant challenges, as degraded image quality from broken or blurred areas introduces considerable OCR errors that compromise information retrieval systems. We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processing heuristics to enhance recognition accuracy. Our method first applies detection algorithms to improve input data quality, then employs probabilistic error correction on OCR output. Experimental results indicate a substantial improvement, with recognition accuracy increasing from 72.86% to 79.26%. This work contributes valuable resources for Bahnar language preservation and provides a framework applicable to other minority language digitization efforts.

Related papers

Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu [0.0]
Manchu, a critically endangered language, lacks effective OCR systems that can handle real-world historical documents.<n>This study develops high-performing OCR systems by fine-tuning three open-source vision-language models.<n>LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data.
arXiv Detail & Related papers (2025-07-09T11:38:20Z)
TextSleuth: Towards Explainable Tampered Text Detection [49.88698441048043]
We propose to explain the basis of tampered text detection with natural language via large multimodal models.<n>To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD.<n>Elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o.<n>To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts.
arXiv Detail & Related papers (2024-12-19T13:10:03Z)
CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)<n>It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.<n>The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z)
Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey [0.6629765271909505]
This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR) A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR.
arXiv Detail & Related papers (2023-12-19T03:01:31Z)
Stable Messenger: Steganography for Message-Concealed Image Generation [6.310429296631073]
We introduce message accuracy'', a novel metric evaluating the entirety of decoded messages for a more holistic evaluation. We propose an adaptive universal loss tailored to enhance message accuracy, named Log-Sum-Exponential (LSE) loss. We also introduce a new latent-aware encoding technique in our framework named Approach, harnessing pretrained Stable Diffusion for advanced steganographic image generation.
arXiv Detail & Related papers (2023-12-03T05:02:43Z)
Leveraging Neural Radiance Fields for Uncertainty-Aware Visual Localization [56.95046107046027]
We propose to leverage Neural Radiance Fields (NeRF) to generate training samples for scene coordinate regression. Despite NeRF's efficiency in rendering, many of the rendered data are polluted by artifacts or only contain minimal information gain.
arXiv Detail & Related papers (2023-10-10T20:11:13Z)
One-stage Low-resolution Text Recognition with High-resolution Knowledge Transfer [53.02254290682613]
Current solutions for low-resolution text recognition typically rely on a two-stage pipeline. We propose an efficient and effective knowledge distillation framework to achieve multi-level knowledge transfer. Experiments show that the proposed one-stage pipeline significantly outperforms super-resolution based two-stage frameworks.
arXiv Detail & Related papers (2023-08-05T02:33:45Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
Dynamic Low-Resolution Distillation for Cost-Efficient End-to-End Text Spotting [49.33891486324731]
We propose a novel cost-efficient Dynamic Low-resolution Distillation (DLD) text spotting framework. It aims to infer images in different small but recognizable resolutions and achieve a better balance between accuracy and efficiency. The proposed method can be optimized end-to-end and adopted in any current text spotting framework to improve the practicability.
arXiv Detail & Related papers (2022-07-14T06:49:59Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task. We first address the data scarcity problem for model training by constructing a document synthesis pipeline. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z)
Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images. We pro-pose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.