IndicSTR12: A Dataset for Indic Scene Text Recognition
- URL: http://arxiv.org/abs/2403.08007v1
- Date: Tue, 12 Mar 2024 18:14:48 GMT
- Title: IndicSTR12: A Dataset for Indic Scene Text Recognition
- Authors: Harsh Lunia, Ajoy Mondal and C V Jawahar
- Abstract summary: This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
- Score: 33.194567434881314
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The importance of Scene Text Recognition (STR) in today's increasingly
digital world cannot be overstated. Given the significance of STR, data
intensive deep learning approaches that auto-learn feature mappings have
primarily driven the development of STR solutions. Several benchmark datasets
and substantial work on deep learning models are available for Latin languages
to meet this need. On more complex, syntactically and semantically, Indian
languages spoken and read by 1.3 billion people, there is less work and
datasets available. This paper aims to address the Indian space's lack of a
comprehensive dataset by proposing the largest and most comprehensive real
dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian
languages. A few works have addressed the same issue, but to the best of our
knowledge, they focused on a small number of Indian languages. The size and
complexity of the proposed dataset are comparable to those of existing Latin
contemporaries, while its multilingualism will catalyse the development of
robust text detection and recognition models. It was created specifically for a
group of related languages with different scripts. The dataset contains over
27000 word-images gathered from various natural scenes, with over 1000
word-images for each language. Unlike previous datasets, the images cover a
broader range of realistic conditions, including blur, illumination changes,
occlusion, non-iconic texts, low resolution, perspective text etc. Along with
the new dataset, we provide a high-performing baseline on three models -
PARSeq, CRNN, and STARNet.
Related papers
- TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - TEXTRON: Weakly Supervised Multilingual Text Detection through Data
Programming [21.88026116276415]
Text detection is a challenging problem in the field of computer vision (CV)
There is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts.
We propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework.
arXiv Detail & Related papers (2024-02-15T09:18:18Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages [40.01333053375582]
We aim to create a text classification dataset encompassing a large number of languages.
We leverage parallel translations of the Bible to construct such a dataset.
By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages.
arXiv Detail & Related papers (2023-05-15T09:43:32Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
Machine Learning [19.203716881791312]
We introduce the Wikipedia-based Image Text (WIT) dataset.
WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.
WIT is the largest multimodal dataset by the number of image-text examples by 3x.
arXiv Detail & Related papers (2021-03-02T18:13:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.