An empirical study of CTC based models for OCR of Indian languages
- URL: http://arxiv.org/abs/2205.06740v1
- Date: Fri, 13 May 2022 16:19:21 GMT
- Title: An empirical study of CTC based models for OCR of Indian languages
- Authors: Minesh Mathew and CV Jawahar
- Abstract summary: Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR.
We present a study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence.
We also introduce a new public dataset called Mozhi for word and line recognition in Indian language.
- Score: 31.5002680968116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognition of text on word or line images, without the need for sub-word
segmentation has become the mainstream of research and development of text
recognition for Indian languages. Modelling unsegmented sequences using
Connectionist Temporal Classification (CTC) is the most commonly used approach
for segmentation-free OCR. In this work we present a comprehensive empirical
study of various neural network models that uses CTC for transcribing step-wise
predictions in the neural network output to a Unicode sequence. The study is
conducted for 13 Indian languages, using an internal dataset that has around
1000 pages per language. We study the choice of line vs word as the recognition
unit, and use of synthetic data to train the models. We compare our models with
popular publicly available OCR tools for end-to-end document image recognition.
Our end-to-end pipeline that employ our recognition models and existing text
segmentation tools outperform these public OCR tools for 8 out of the 13
languages. We also introduce a new public dataset called Mozhi for word and
line recognition in Indian language. The dataset contains more than 1.2 million
annotated word images (120 thousand text lines) across 13 Indian languages. Our
code, trained models and the Mozhi dataset will be made available at
http://cvit.iiit.ac.in/research/projects/cvit-projects/
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - A Benchmark and Dataset for Post-OCR text correction in Sanskrit [23.45279030301887]
Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
arXiv Detail & Related papers (2022-11-15T08:32:18Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - End-to-End Optical Character Recognition for Bengali Handwritten Words [0.0]
This paper introduces an end-to-end OCR system for Bengali language.
The proposed architecture implements an end to end strategy that recognises handwritten Bengali words from handwritten word images.
arXiv Detail & Related papers (2021-05-09T20:48:56Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.