Data Augmentation for Scene Text Recognition
- URL: http://arxiv.org/abs/2108.06949v1
- Date: Mon, 16 Aug 2021 07:53:30 GMT
- Title: Data Augmentation for Scene Text Recognition
- Authors: Rowel Atienza
- Abstract summary: Scene text recognition (STR) is a challenging task in computer vision due to the large number of possible text appearances in natural scenes.
Most STR models rely on synthetic datasets for training since there are no sufficiently big and publicly available labelled real datasets.
In this paper, we introduce STRAug which is made of 36 image augmentation functions designed for STR.
- Score: 19.286766429954174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) is a challenging task in computer vision due to
the large number of possible text appearances in natural scenes. Most STR
models rely on synthetic datasets for training since there are no sufficiently
big and publicly available labelled real datasets. Since STR models are
evaluated using real data, the mismatch between training and testing data
distributions results into poor performance of models especially on challenging
text that are affected by noise, artifacts, geometry, structure, etc. In this
paper, we introduce STRAug which is made of 36 image augmentation functions
designed for STR. Each function mimics certain text image properties that can
be found in natural scenes, caused by camera sensors, or induced by signal
processing operations but poorly represented in the training dataset. When
applied to strong baseline models using RandAugment, STRAug significantly
increases the overall absolute accuracy of STR models across regular and
irregular test datasets by as much as 2.10% on Rosetta, 1.48% on R2AM, 1.30% on
CRNN, 1.35% on RARE, 1.06% on TRBA and 0.89% on GCRNN. The diversity and
simplicity of API provided by STRAug functions enable easy replication and
validation of existing data augmentation methods for STR. STRAug is available
at https://github.com/roatienza/straug.
Related papers
- Getting it Right: Improving Spatial Consistency in Text-to-Image Models [103.52640413616436]
One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt.
We create SPRIGHT, the first spatially focused, large-scale dataset, by re-captioning 6 million images from 4 widely used vision datasets.
We find that training on images containing a larger number of objects leads to substantial improvements in spatial consistency, including state-of-the-art results on T2I-CompBench with a spatial score of 0.2133, by fine-tuning on 500 images.
arXiv Detail & Related papers (2024-04-01T15:55:25Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Geometric Perception based Efficient Text Recognition [0.0]
In real-world applications with fixed camera positions, the underlying data tends to be regular scene text.
This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models.
We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
arXiv Detail & Related papers (2023-02-08T04:19:24Z) - The Surprisingly Straightforward Scene Text Removal Method With Gated
Attention and Region of Interest Generation: A Comprehensive Prominent Model
Analysis [0.76146285961466]
Scene text removal (STR) is a task of erasing text from natural scene images.
We introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper.
Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics.
arXiv Detail & Related papers (2022-10-14T03:34:21Z) - What If We Only Use Real Datasets for Scene Text Recognition? Toward
Scene Text Recognition With Fewer Labels [53.51264148594141]
Scene text recognition (STR) task has a common practice: All state-of-the-art STR models are trained on large synthetic data.
Training STR models on real data is nearly impossible because real data is insufficient.
We show that we can train STR models satisfactorily only with real labeled data.
arXiv Detail & Related papers (2021-03-07T17:05:54Z) - Learning Statistical Texture for Semantic Segmentation [53.7443670431132]
We propose a novel Statistical Texture Learning Network (STLNet) for semantic segmentation.
For the first time, STLNet analyzes the distribution of low level information and efficiently utilizes them for the task.
Based on QCO, two modules are introduced: (1) Texture Enhance Module (TEM), to capture texture-related information and enhance the texture details; (2) Pyramid Texture Feature Extraction Module (PTFEM), to effectively extract the statistical texture features from multiple scales.
arXiv Detail & Related papers (2021-03-06T15:05:35Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.