Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
- URL: http://arxiv.org/abs/2507.06761v1
- Date: Wed, 09 Jul 2025 11:38:20 GMT
- Title: Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
- Authors: Yan Hon Michael Chung, Donghyeok Choi,
- Abstract summary: Manchu, a critically endangered language, lacks effective OCR systems that can handle real-world historical documents.<n>This study develops high-performing OCR systems by fine-tuning three open-source vision-language models.<n>LLaMA-3.2-11B achieved exceptional performance with 98.3% word accuracy and 0.0024 character error rate on synthetic data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manchu, a critically endangered language essential for understanding early modern Eastern Eurasian history, lacks effective OCR systems that can handle real-world historical documents. This study develops high-performing OCR systems by fine-tuning three open-source vision-language models (LLaMA-3.2-11B, Qwen2.5-VL-7B, Qwen2.5-VL-3B) on 60,000 synthetic Manchu word images using parameter-efficient training. LLaMA-3.2-11B achieved exceptional performance with 98.3\% word accuracy and 0.0024 character error rate on synthetic data, while crucially maintaining 93.1\% accuracy on real-world handwritten documents. Comparative evaluation reveals substantial advantages over traditional approaches: while a CRNN baseline achieved 99.8\% synthetic accuracy, it suffered severe degradation to 72.5\% on real documents. Our approach demonstrates effective synthetic-to-real domain transfer, providing a cost-effective solution deployable on accessible infrastructure. This work establishes a transferable framework for endangered language OCR that removes technical and financial barriers in digital humanities, enabling historians and linguists to process historical archives without specialized computing resources. Code and model weights are available at https://github.com/mic7ch1/ManchuAI-OCR.
Related papers
- Low-Resource Heuristics for Bahnaric Optical Character Recognition Improvement [3.2537431443459255]
Bahnar, a minority language spoken across Vietnam, Cambodia, and Laos, faces significant preservation challenges due to limited research and data availability.<n>This study addresses the critical need for accurate digitization of Bahnar language documents through optical character recognition (OCR) technology.<n>We propose a comprehensive approach combining advanced table and non-table detection techniques with probability-based post-processings to enhance recognition accuracy.
arXiv Detail & Related papers (2026-01-06T12:22:03Z) - A Hybrid Architecture for Multi-Stage Claim Document Understanding: Combining Vision-Language Models and Machine Learning for Real-Time Processing [0.0]
Claims documents are fundamental to healthcare and insurance operations, serving as the basis for reimbursement, auditing, and compliance.<n>This paper presents a robust multi-stage pipeline that integrates the multilingual optical character recognition (OCR) engine PaddleOCR, a traditional Logistic Regression, and a compact Vision-Language Model (VLM), Qwen 2.5-VL-7B.<n>The proposed system achieves a document-type classification accuracy of over 95 percent and a field-level extraction accuracy of approximately 87 percent, while maintaining an average processing latency of under 2 seconds per document.
arXiv Detail & Related papers (2026-01-05T08:40:44Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Harnessing PDF Data for Improving Japanese Large Multimodal Models [56.80385809059738]
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited.<n>Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge.<n>We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs.
arXiv Detail & Related papers (2025-02-20T17:59:59Z) - Scrambled text: training Language Models to correct OCR errors using synthetic data [0.0]
This paper shows that fine-tuning a language model on synthetic data can significantly improve the ability to correct OCR errors.
Models trained on synthetic data reduce the character error rate by 55% and word error rate by 32% over the base LM and outperform models trained on real data.
arXiv Detail & Related papers (2024-09-29T15:20:37Z) - CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models [0.0]
This paper introduces Context Leveraging OCR Correction (CLOCR-C)<n>It uses the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality.<n>The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process.
arXiv Detail & Related papers (2024-08-30T17:26:05Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts [12.346821696831805]
We present a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output.
This paper is divided into four sections: dataset, model architecture, training and analysis.
arXiv Detail & Related papers (2023-04-07T00:45:12Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.