CSTR: A Classification Perspective on Scene Text Recognition
- URL: http://arxiv.org/abs/2102.10884v1
- Date: Mon, 22 Feb 2021 10:32:07 GMT
- Title: CSTR: A Classification Perspective on Scene Text Recognition
- Authors: Hongxiang Cai, Jun Sun, Yichao Xiong
- Abstract summary: We propose a new perspective on scene text recognition, in which we model the scene text recognition as an image classification problem.
Based on the image classification perspective, a scene text recognition model is proposed, which is named as CSTR.
CSTR achieves nearly state-of-the-art performance on six public benchmarks including regular text, irregular text.
- Score: 3.286661798699067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The prevalent perspectives of scene text recognition are from sequence to
sequence (seq2seq) and segmentation. In this paper, we propose a new
perspective on scene text recognition, in which we model the scene text
recognition as an image classification problem. Based on the image
classification perspective, a scene text recognition model is proposed, which
is named as CSTR.
The CSTR model consists of a series of convolutional layers and a global
average pooling layer at the end, followed by independent multi-class
classification heads, each of which predicts the corresponding character of the
word sequence in input image. The CSTR model is easy to train using parallel
cross entropy losses.
CSTR is as simple as image classification models like ResNet
\cite{he2016deep} which makes it easy to implement, and the fully convolutional
neural network architecture makes it efficient to train and deploy. We
demonstrate the effectiveness of the classification perspective on scene text
recognition with thorough experiments. Futhermore, CSTR achieves nearly
state-of-the-art performance on six public benchmarks including regular text,
irregular text. The code will be available at
https://github.com/Media-Smart/vedastr.
Related papers
- Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - SCATTER: Selective Context Attentional Scene Text Recognizer [16.311256552979835]
Scene Text Recognition (STR) is the task of recognizing text against complex image backgrounds.
Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes.
We introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER)
arXiv Detail & Related papers (2020-03-25T09:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.