Towards Boosting the Accuracy of Non-Latin Scene Text Recognition
- URL: http://arxiv.org/abs/2201.03185v1
- Date: Mon, 10 Jan 2022 06:36:43 GMT
- Title: Towards Boosting the Accuracy of Non-Latin Scene Text Recognition
- Authors: Sanjana Gunna, Rohit Saluja and C. V. Jawahar
- Abstract summary: Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
- Score: 27.609596088151644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene-text recognition is remarkably better in Latin languages than the
non-Latin languages due to several factors like multiple fonts, simplistic
vocabulary statistics, updated data generation tools, and writing systems. This
paper examines the possible reasons for low accuracy by comparing English
datasets with non-Latin languages. We compare various features like the size
(width and height) of the word images and word length statistics. Over the last
decade, generating synthetic datasets with powerful deep learning techniques
has tremendously improved scene-text recognition. Several controlled
experiments are performed on English, by varying the number of (i) fonts to
create the synthetic data and (ii) created word images. We discover that these
factors are critical for the scene-text recognition systems. The English
synthetic datasets utilize over 1400 fonts while Arabic and other non-Latin
datasets utilize less than 100 fonts for data generation. Since some of these
languages are a part of different regions, we garner additional fonts through a
region-based search to improve the scene-text recognition models in Arabic and
Devanagari. We improve the Word Recognition Rates (WRRs) on Arabic MLT-17 and
MLT-19 datasets by 24.54% and 2.32% compared to previous works or baselines. We
achieve WRR gains of 7.88% and 3.72% for IIIT-ILST and MLT-19 Devanagari
datasets.
Related papers
- KhmerST: A Low-Resource Khmer Scene Text Detection and Recognition Benchmark [1.5409800688911346]
We introduce the first Khmer scene-text dataset, featuring 1,544 expert-annotated images.
This diverse dataset includes flat text, raised text, poorly illuminated text, distant polygon and partially obscured text.
arXiv Detail & Related papers (2024-10-23T21:04:24Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - A Benchmark and Dataset for Post-OCR text correction in Sanskrit [23.45279030301887]
Sanskrit is a classical language with about 30 million extant manuscripts fit for digitisation.
We release a post-OCR text correction dataset containing around 218,000 sentences, with 1.5 million words, from 30 different books.
arXiv Detail & Related papers (2022-11-15T08:32:18Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Transfer Learning for Scene Text Recognition in Indian Languages [27.609596088151644]
We investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages.
We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical.
We set new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam datasets from IIIT-ILST and Bangla dataset from MLT-17.
arXiv Detail & Related papers (2022-01-10T06:14:49Z) - Tamizhi-Net OCR: Creating A Quality Large Scale Tamil-Sinhala-English
Parallel Corpus Using Deep Learning Based Printed Character Recognition (PCR) [2.0305676256390934]
Most of the low resource languages do not have the necessary resources to create a substantial monolingual corpus.
These languages may often be found in government proceedings but mostly in the form of Portable Document Formats (PDFs) that contains legacy fonts.
Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding.
arXiv Detail & Related papers (2021-09-13T13:26:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.