SVTR: Scene Text Recognition with a Single Visual Model
- URL: http://arxiv.org/abs/2205.00159v1
- Date: Sat, 30 Apr 2022 04:37:01 GMT
- Title: SVTR: Scene Text Recognition with a Single Visual Model
- Authors: Yongkun Du and Zhineng Chen and Caiyan Jia and Xiaoting Yin and
Tianlun Zheng and Chenxia Li and Yuning Du and Yu-Gang Jiang
- Abstract summary: We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework.
The method, termed SVTR, firstly decomposes an image text into small patches named character components.
Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
- Score: 44.26135584093631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dominant scene text recognition models commonly contain two building blocks,
a visual model for feature extraction and a sequence model for text
transcription. This hybrid architecture, although accurate, is complex and less
efficient. In this study, we propose a Single Visual model for Scene Text
recognition within the patch-wise image tokenization framework, which dispenses
with the sequential modeling entirely. The method, termed SVTR, firstly
decomposes an image text into small patches named character components.
Afterward, hierarchical stages are recurrently carried out by component-level
mixing, merging and/or combining. Global and local mixing blocks are devised to
perceive the inter-character and intra-character patterns, leading to a
multi-grained character component perception. Thus, characters are recognized
by a simple linear prediction. Experimental results on both English and Chinese
scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L
(Large) achieves highly competitive accuracy in English and outperforms
existing methods by a large margin in Chinese, while running faster. In
addition, SVTR-T (Tiny) is an effective and much smaller model, which shows
appealing speed at inference. The code is publicly available at
https://github.com/PaddlePaddle/PaddleOCR.
Related papers
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z) - General Detection-based Text Line Recognition [15.761142324480165]
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR)
Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding.
We improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
arXiv Detail & Related papers (2024-09-25T17:05:55Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model
with Non-textual Features for CTR Prediction [12.850529317775198]
We propose a novel framework BERT4CTR, with the Uni-Attention mechanism that can benefit from the interactions between non-textual and textual features.
BERT4CTR can outperform significantly the state-of-the-art frameworks to handle multi-modal inputs and be applicable to Click-Through-Rate (CTR) prediction.
arXiv Detail & Related papers (2023-08-17T08:25:54Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.