TextAdaIN: Fine-Grained AdaIN for Robust Text Recognition
- URL: http://arxiv.org/abs/2105.03906v1
- Date: Sun, 9 May 2021 10:47:48 GMT
- Title: TextAdaIN: Fine-Grained AdaIN for Robust Text Recognition
- Authors: Oren Nuriel, Sharon Fogel, Ron Litman
- Abstract summary: In text recognition, we reveal that it is rather the local image statistics which the networks overly depend on.
We suggest an approach to regulate the reliance on local statistics that improves overall text recognition performance.
Our method, termed TextAdaIN, creates local distortions in the feature map which prevent the network from overfitting to the local statistics.
- Score: 3.3946853660795884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging the characteristics of convolutional layers, image classifiers are
extremely effective. However, recent works have exposed that in many cases they
immoderately rely on global image statistics that are easy to manipulate while
preserving image semantics. In text recognition, we reveal that it is rather
the local image statistics which the networks overly depend on. Motivated by
this, we suggest an approach to regulate the reliance on local statistics that
improves overall text recognition performance.
Our method, termed TextAdaIN, creates local distortions in the feature map
which prevent the network from overfitting to the local statistics. It does so
by deliberately mismatching fine-grained feature statistics between samples in
a mini-batch. Despite TextAdaIN's simplicity, extensive experiments show its
effectiveness compared to other, more complicated methods. TextAdaIN achieves
state-of-the-art results on standard handwritten text recognition benchmarks.
Additionally, it generalizes to multiple architectures and to the domain of
scene text recognition. Furthermore, we demonstrate that integrating TextAdaIN
improves robustness towards image corruptions.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Scene Text Recognition with Image-Text Matching-guided Dictionary [17.073688809336456]
We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network.
Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space.
Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
arXiv Detail & Related papers (2023-05-08T07:47:49Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - The Surprisingly Straightforward Scene Text Removal Method With Gated
Attention and Region of Interest Generation: A Comprehensive Prominent Model
Analysis [0.76146285961466]
Scene text removal (STR) is a task of erasing text from natural scene images.
We introduce a simple yet extremely effective Gated Attention (GA) and Region-of-Interest Generation (RoIG) methodology in this paper.
Experimental results on the benchmark dataset show that our method significantly outperforms existing state-of-the-art methods in almost all metrics.
arXiv Detail & Related papers (2022-10-14T03:34:21Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - AE TextSpotter: Learning Visual and Linguistic Representation for
Ambiguous Text Spotting [98.08853679310603]
This work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter)
AE TextSpotter learns both visual and linguistic features to significantly reduce ambiguity in text detection.
To our knowledge, it is the first time to improve text detection by using a language model.
arXiv Detail & Related papers (2020-08-03T08:40:01Z) - Text Recognition -- Real World Data and Where to Find Them [36.10220484561196]
We present a method for exploiting weakly annotated images to improve text extraction pipelines.
The approach uses an arbitrary end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions.
It produces nearly error-free, localised instances of scene text, which we treat as "pseudo ground truth" (PGT)
arXiv Detail & Related papers (2020-07-06T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.