VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction
- URL: http://arxiv.org/abs/2510.10490v1
- Date: Sun, 12 Oct 2025 07:47:41 GMT
- Title: VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction
- Authors: Prawaal Sharma, Poonam Goyal, Vidisha Sharma, Navneet Goyal,
- Abstract summary: UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered.<n>Low resource languages are at a greater risk of extinction.<n>Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion.<n>We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling.
- Score: 3.03088776072187
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered. Attrition of a language leads to loss of traditional wisdom, folk literature, and the essence of the community that uses it. It is therefore imperative to bring digital inclusion to these languages and avoid its extinction. Low resource languages are at a greater risk of extinction. Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion. We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling. We augment the labelled data for diversity and volume using image transformations and Generative Adversarial Networks. Voltage has been designed using Takri - a family of scripts used in 16th to 20th century in the Himalayan regions of India. We present results for Takri along with other Indic scripts (both low and high resource) to substantiate the universal behavior of the methodology. An accuracy of 95% for machine printed and 87% for handwritten samples on Takri script has been achieved. We conduct baseline and ablation studies along with building downstream use cases for Takri, demonstrating the usefulness of our work.
Related papers
- synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier [0.0]
We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages.<n>Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets.<n>We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset.
arXiv Detail & Related papers (2026-01-22T17:01:33Z) - Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori [2.1900575893223526]
We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks.<n>Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Mori, a Polynesian language spoken in the Cook Islands.
arXiv Detail & Related papers (2025-12-22T18:04:24Z) - Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts [0.0]
This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik.<n>Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges.
arXiv Detail & Related papers (2024-12-20T18:05:22Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - A Benchmark and Dataset for Post-OCR text correction in Sanskrit [23.45279030301887]
Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
arXiv Detail & Related papers (2022-11-15T08:32:18Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - HCR-Net: A deep learning based script independent handwritten character
recognition network [5.8067395321424975]
Handwritten character recognition (HCR) remains a challenging pattern recognition problem despite decades of research.
We have proposed a script independent deep learning network for HCR research, called HCR-Net, that sets a new research direction for the field.
arXiv Detail & Related papers (2021-08-15T05:48:07Z) - OCR Post Correction for Endangered Language Texts [113.8242302688894]
We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages.
We present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting.
We develop an OCR post-correction method tailored to ease training in this data-scarce setting.
arXiv Detail & Related papers (2020-11-10T21:21:08Z) - Handwritten Script Identification from Text Lines [38.1188690493442]
We propose a robust method towards identifying scripts from handwritten documents at text line-level.
The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT)
The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script.
arXiv Detail & Related papers (2020-09-16T02:43:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.