Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English
Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)
- URL: http://arxiv.org/abs/2109.05952v1
- Date: Mon, 13 Sep 2021 13:26:30 GMT
- Title: Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English
Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR)
- Authors: Charangan Vasantharajan and Uthayasanker Thayasivam
- Abstract summary: Most of the low resource languages do not have the necessary resources to create a substantial monolingual corpus.
These languages may often be found in government proceedings but mostly in the form of Portable Document Formats (PDFs) that contains legacy fonts.
Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding.
- Score: 2.0305676256390934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of the low resource languages do not have the necessary resources to
create even a substantial monolingual corpus. These languages may often be
found in government proceedings but mostly in the form of Portable Document
Formats (PDFs) that contains legacy fonts. Extracting text from these documents
to create a monolingual corpus is challenging due to legacy font usage and
printer-friendly encoding which are not optimized for text extraction.
Therefore, we propose a simple, automatic, and novel idea that can scale for
Tamil, Sinhala, and English languages and many documents. For this purpose, we
enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on
many legacy fonts to recognize printed characters in the above languages.
Especially, our model detects code-mix text, numbers, and special characters
from the printed document. It is shown that this approach can boost the
character-level accuracy of Tesseract 4.1.1 from 85.5 to 98.2 for Tamil (+12.9%
relative change) and 91.8 to 94.8 for Sinhala (+3.26% relative change) on a
dataset that is considered as challenging by its authors.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with
Unsupervised Text Pretraining [65.30528567491984]
This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language.
The use of text-only data allows the development of TTS systems for low-resource languages.
Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.
arXiv Detail & Related papers (2023-01-30T00:53:50Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - TaTa: A Multilingual Table-to-Text Dataset for African Languages [32.348630887289524]
Table-to-Text in African languages (TaTa) is the first large multilingual table-to-text dataset with a focus on African languages.
TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorub'a) and a zero-shot test language (Russian)
arXiv Detail & Related papers (2022-10-31T21:05:42Z) - Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z) - Transfer Learning for Scene Text Recognition in Indian Languages [27.609596088151644]
We investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages.
We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical.
We set new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam datasets from IIIT-ILST and Bangla dataset from MLT-17.
arXiv Detail & Related papers (2022-01-10T06:14:49Z) - Large Scale Font Independent Urdu Text Recognition System [1.5229257192293197]
There exists no automated system that can reliably recognize printed Urdu text in images and videos across different fonts.
We have developed Qaida, a large scale data set with 256 fonts, and a complete Urdu lexicon.
We have also developed a Convolutional Neural Network (CNN) based classification model which can recognize Urdu ligatures with 84.2% accuracy.
arXiv Detail & Related papers (2020-05-14T06:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.