A Benchmark and Dataset for Post-OCR text correction in Sanskrit
- URL: http://arxiv.org/abs/2211.07980v1
- Date: Tue, 15 Nov 2022 08:32:18 GMT
- Title: A Benchmark and Dataset for Post-OCR text correction in Sanskrit
- Authors: Ayush Maheshwari, Nikhil Singh, Amrith Krishna, Ganesh Ramakrishnan
- Abstract summary: Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
- Score: 23.45279030301887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sanskrit is a classical language with about 30 million extant manuscripts fit
for digitisation, available in written, printed or scannedimage forms. However,
it is still considered to be a low-resource language when it comes to available
digital resources. In this work, we release a post-OCR text correction dataset
containing around 218,000 sentences, with 1.5 million words, from 30 different
books. Texts in Sanskrit are known to be diverse in terms of their linguistic
and stylistic usage since Sanskrit was the 'lingua franca' for discourse in the
Indian subcontinent for about 3 millennia. Keeping this in mind, we release a
multi-domain dataset, from areas as diverse as astronomy, medicine and
mathematics, with some of them as old as 18 centuries. Further, we release
multiple strong baselines as benchmarks for the task, based on pre-trained
Seq2Seq language models. We find that our best-performing model, consisting of
byte level tokenization in conjunction with phonetic encoding (Byt5+SLP1),
yields a 23% point increase over the OCR output in terms of word and character
error rates. Moreover, we perform extensive experiments in evaluating these
models on their performance and analyse common causes of mispredictions both at
the graphemic and lexical levels. Our code and dataset is publicly available at
https://github.com/ayushbits/pe-ocr-sanskrit.
Related papers
- Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - SÄmayik: A Benchmark and Dataset for English-Sanskrit Translation [30.315293326789828]
S=amayik is a dataset of around 53,000 parallel English-Sanskrit sentences, written in contemporary prose.
S=amayik is curated from a diverse range of domains, including language instruction material, textual teaching pedagogy, and online tutorials.
arXiv Detail & Related papers (2023-05-23T12:32:24Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - An empirical study of CTC based models for OCR of Indian languages [31.5002680968116]
Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR.
We present a study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence.
We also introduce a new public dataset called Mozhi for word and line recognition in Indian language.
arXiv Detail & Related papers (2022-05-13T16:19:21Z) - Aksharantar: Open Indic-language Transliteration datasets and models for
the Next Billion Users [32.23606056944172]
We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora.
The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts.
Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family.
arXiv Detail & Related papers (2022-05-06T05:13:12Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.