SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
- URL: http://arxiv.org/abs/2411.15858v1
- Date: Sun, 24 Nov 2024 14:21:35 GMT
- Title: SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
- Authors: Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang,
- Abstract summary: We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
- Score: 77.28814034644287
- License:
- Abstract: Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally have worse accuracy than encoder-decoder-based methods (EDTRs), particularly in challenging scenarios. In this paper, we propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed. SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context, which endows it with the capability to deal with challenging and diverse text instances. First, a multi-size resizing (MSR) strategy is proposed to adaptively resize the text and maintain its readability. Meanwhile, we introduce a feature rearrangement module (FRM) to ensure that visual features accommodate the alignment requirement of CTC well, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module (SGM). It integrates linguistic context into the visual model, allowing it to leverage language information for improved accuracy. Moreover, SGM can be omitted at the inference stage and would not increase the inference cost. We evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared with 24 mainstream STR models across multiple scenarios, including different types of text irregularity, languages, and long text. The results indicate that SVTRv2 surpasses all the EDTRs across the scenarios in terms of accuracy and speed. Code is available at https://github.com/Topdu/OpenOCR.
Related papers
- Coupling Speech Encoders with Downstream Text Models [4.679869237248675]
We present a modular approach to building cascade speech translation models.
We preserve state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task.
arXiv Detail & Related papers (2024-07-24T19:29:13Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech
Recognition [43.081758770899235]
We present RASR2, a research-oriented generic S2S decoder implemented in C++.
It offers a strong flexibility/compatibility for various S2S models, language models, label units/topologies and neural network architectures.
It provides efficient decoding for both open- and closed-vocabulary scenarios based on a generalized search framework with rich support for different search modes and settings.
arXiv Detail & Related papers (2023-05-28T17:48:48Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework.
The method, termed SVTR, firstly decomposes an image text into small patches named character components.
Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
arXiv Detail & Related papers (2022-04-30T04:37:01Z) - Towards Escaping from Language Bias and OCR Error: Semantics-Centered
Text Visual Question Answering [14.010472385359163]
Texts in scene images convey critical information for scene understanding and reasoning.
Current TextVQA models do not center on the text and suffer from several limitations.
We propose a novel Semantics-Centered Network (SC-Net) that consists of an instance-level contrastive semantic prediction module and a semantics-centered transformer module.
arXiv Detail & Related papers (2022-03-24T08:21:41Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.