SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor
- URL: http://arxiv.org/abs/2401.10110v5
- Date: Tue, 20 Aug 2024 02:34:29 GMT
- Title: SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor
- Authors: Xianfu Cheng, Weixiao Zhou, Xiang Li, Jian Yang, Hang Zhang, Tao Sun, Wei Zhang, Yuying Mai, Tongliang Li, Xiaoming Chen, Zhoujun Li,
- Abstract summary: Scene Text Recognition is an important and challenging upstream task for building structured information databases.
Current state-of-the-art (SOTA) models for STR exhibit high performance, but suffer from low inference efficiency.
We propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition.
- Score: 32.29602765394547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at https://github.com/cxfyxl/VIPTR.
Related papers
- FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [14.054151352916296]
This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer-Decoder architecture.
FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts.
Our results indicate that FastTextSpotter achieves superior accuracy in detecting and recognizing multilingual scene text.
arXiv Detail & Related papers (2024-08-27T12:28:41Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - YORO -- Lightweight End to End Visual Grounding [58.17659561501071]
YORO is a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
It consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object.
YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins.
arXiv Detail & Related papers (2022-11-15T05:34:40Z) - SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework.
The method, termed SVTR, firstly decomposes an image text into small patches named character components.
Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
arXiv Detail & Related papers (2022-04-30T04:37:01Z) - Pushing the Performance Limit of Scene Text Recognizer without Human
Annotation [17.092815629040388]
We aim to boost STR models by leveraging both synthetic data and the numerous real unlabeled images.
A character-level consistency regularization is designed to mitigate the misalignment between characters in sequence recognition.
arXiv Detail & Related papers (2022-04-16T04:42:02Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Real-Time Scene Text Detection with Differentiable Binarization and
Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field.
We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network.
An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.