Context Perception Parallel Decoder for Scene Text Recognition
- URL: http://arxiv.org/abs/2307.12270v2
- Date: Mon, 9 Oct 2023 05:48:11 GMT
- Title: Context Perception Parallel Decoder for Scene Text Recognition
- Authors: Yongkun Du and Zhineng Chen and Caiyan Jia and Xiaoting Yin and
Chenxia Li and Yuning Du and Yu-Gang Jiang
- Abstract summary: Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
- Score: 52.620841341333524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text recognition (STR) methods have struggled to attain high accuracy
and fast inference speed. Autoregressive (AR)-based models implement the
recognition in a character-by-character manner, showing superiority in accuracy
but with slow inference speed. Alternatively, parallel decoding (PD)-based
models infer all characters in a single decoding pass, offering faster
inference speed but generally worse accuracy. We first present an empirical
study of AR decoding in STR, and discover that the AR decoder not only models
linguistic context, but also provides guidance on visual context perception.
Consequently, we propose Context Perception Parallel Decoder (CPPD) to predict
the character sequence in a PD pass. CPPD devises a character counting module
to infer the occurrence count of each character, and a character ordering
module to deduce the content-free reading order and placeholders. Meanwhile,
the character prediction task associates the placeholders with characters. They
together build a comprehensive recognition context. We construct a series of
CPPD models and also plug the proposed modules into existing STR decoders.
Experiments on both English and Chinese benchmarks demonstrate that the CPPD
models achieve highly competitive accuracy while running approximately 8x
faster than their AR-based counterparts. Moreover, the plugged models achieve
significant accuracy improvements. Code is at
\href{https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/algorithm_rec_cppd_en.md}{this
https URL}.
Related papers
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z) - General Detection-based Text Line Recognition [15.761142324480165]
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR)
Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding.
We improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
arXiv Detail & Related papers (2024-09-25T17:05:55Z) - When Counting Meets HMER: Counting-Aware Network for Handwritten
Mathematical Expression Recognition [57.51793420986745]
We propose an unconventional network for handwritten mathematical expression recognition (HMER) named Counting-Aware Network (CAN)
We design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations.
Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models.
arXiv Detail & Related papers (2022-07-23T08:39:32Z) - Scene Text Recognition with Permuted Autoregressive Sequence Models [15.118059441365343]
Context-aware STR methods typically use internal autoregressive (AR) language models (LM)
Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling.
It achieves context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context.
arXiv Detail & Related papers (2022-07-14T14:51:50Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework.
The method, termed SVTR, firstly decomposes an image text into small patches named character components.
Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
arXiv Detail & Related papers (2022-04-30T04:37:01Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.