Related papers: VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition

VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition

URL: http://arxiv.org/abs/2401.10110v3
Date: Wed, 24 Jan 2024 03:05:53 GMT
Title: VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text Recognition
Authors: Xianfu Cheng, Weixiao Zhou, Xiang Li, Xiaoming Chen, Jian Yang, Tongliang Li, Zhoujun Li
Abstract summary: Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. We propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR) VIPTR achieves an impressive balance between high performance and rapid inference speeds in the domain of STR.
Score: 32.12388950990217
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes. Although current state-of-the-art models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by multiple self-attention layers, while eschewing the traditional sequence decoder. This design choice results in a lightweight and efficient model capable of handling inputs of varying sizes. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of VIPTR. Notably, the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which blends high accuracy with efficiency and greatly benefits real-world applications requiring fast and reliable text recognition. The code is publicly available at https://github.com/cxfyxl/VIPTR.

Related papers

Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation [70.15341084443236]
We re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. We propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features. Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge.
arXiv Detail & Related papers (2025-03-11T04:49:25Z)
FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [14.054151352916296]
This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer-Decoder architecture. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts. Our results indicate that FastTextSpotter achieves superior accuracy in detecting and recognizing multilingual scene text.
arXiv Detail & Related papers (2024-08-27T12:28:41Z)
Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z)
Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed. We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception. We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z)
YORO -- Lightweight End to End Visual Grounding [58.17659561501071]
YORO is a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. It consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins.
arXiv Detail & Related papers (2022-11-15T05:34:40Z)
SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
arXiv Detail & Related papers (2022-04-30T04:37:01Z)
Pushing the Performance Limit of Scene Text Recognizer without Human Annotation [17.092815629040388]
We aim to boost STR models by leveraging both synthetic data and the numerous real unlabeled images. A character-level consistency regularization is designed to mitigate the misalignment between characters in sequence recognition.
arXiv Detail & Related papers (2022-04-16T04:42:02Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion [62.269219152425556]
segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field. We propose a Differentiable Binarization (DB) module that integrates the binarization process into a segmentation network. An efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively.
arXiv Detail & Related papers (2022-02-21T15:30:14Z)
AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes. We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance. Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.