Transfer Learning for Scene Text Recognition in Indian Languages
- URL: http://arxiv.org/abs/2201.03180v1
- Date: Mon, 10 Jan 2022 06:14:49 GMT
- Title: Transfer Learning for Scene Text Recognition in Indian Languages
- Authors: Sanjana Gunna, Rohit Saluja and C. V. Jawahar
- Abstract summary: We investigate the power of transfer learning for all the layers of deep scene text recognition networks from English to two common Indian languages.
We show that the transfer of English models to simple synthetic datasets of Indian languages is not practical.
We set new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam datasets from IIIT-ILST and Bangla dataset from MLT-17.
- Score: 27.609596088151644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition in low-resource Indian languages is challenging
because of complexities like multiple scripts, fonts, text size, and
orientations. In this work, we investigate the power of transfer learning for
all the layers of deep scene text recognition networks from English to two
common Indian languages. We perform experiments on the conventional CRNN model
and STAR-Net to ensure generalisability. To study the effect of change in
different scripts, we initially run our experiments on synthetic word images
rendered using Unicode fonts. We show that the transfer of English models to
simple synthetic datasets of Indian languages is not practical. Instead, we
propose to apply transfer learning techniques among Indian languages due to
similarity in their n-gram distributions and visual features like the vowels
and conjunct characters. We then study the transfer learning among six Indian
languages with varying complexities in fonts and word length statistics. We
also demonstrate that the learned features of the models transferred from other
Indian languages are visually closer (and sometimes even better) to the
individual model features than those transferred from English. We finally set
new benchmarks for scene-text recognition on Hindi, Telugu, and Malayalam
datasets from IIIT-ILST and Bangla dataset from MLT-17 by achieving 6%, 5%, 2%,
and 23% gains in Word Recognition Rates (WRRs) compared to previous works. We
further improve the MLT-17 Bangla results by plugging in a novel correction
BiLSTM into our model. We additionally release a dataset of around 440 scene
images containing 500 Gujarati and 2535 Tamil words. WRRs improve over the
baselines by 8%, 4%, 5%, and 3% on the MLT-19 Hindi and Bangla datasets and the
Gujarati and Tamil datasets.
Related papers
- CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Multilingual Text Style Transfer: Datasets & Models for Indian Languages [1.116636487692753]
This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages.
We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages.
We evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches.
arXiv Detail & Related papers (2024-05-31T14:05:27Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Optical Text Recognition in Nepali and Bengali: A Transformer-based Approach [0.0]
This paper discusses text recognition for two scripts: Bengali and Nepali.
There are about 300 and 40 million Bengali and Nepali speakers respectively.
The results signify that the suggested technique corresponds with current approaches.
arXiv Detail & Related papers (2024-04-03T00:21:14Z) - IndiText Boost: Text Augmentation for Low Resource India Languages [0.0]
We focus on implementing techniques like Easy Data Augmentation, Back Translation, Paraphrasing, Text Generation using LLMs, and Text Expansion using LLMs for text classification on different languages.
According to our knowledge, no such work exists for text augmentation on Indian languages.
arXiv Detail & Related papers (2024-01-23T20:54:40Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Towards Boosting the Accuracy of Non-Latin Scene Text Recognition [27.609596088151644]
Scene-text recognition is remarkably better in Latin languages than the non-Latin languages.
This paper examines the possible reasons for low accuracy by comparing English datasets with non-Latin languages.
arXiv Detail & Related papers (2022-01-10T06:36:43Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - Offensive Language Identification in Low-resourced Code-mixed Dravidian
languages using Pseudo-labeling [0.16252563723817934]
We classify codemixed social media comments/posts in the Dravidian languages of Tamil, Kannada, and Malayalam.
A custom dataset is constructed by transliterating all the code-mixed texts into the respective Dravidian language.
We fine-tune several recent pretrained language models on the newly constructed dataset.
arXiv Detail & Related papers (2021-08-27T08:43:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.