VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text
Recognition
- URL: http://arxiv.org/abs/2401.10110v3
- Date: Wed, 24 Jan 2024 03:05:53 GMT
- Title: VIPTR: A Vision Permutable Extractor for Fast and Efficient Scene Text
Recognition
- Authors: Xianfu Cheng, Weixiao Zhou, Xiang Li, Xiaoming Chen, Jian Yang,
Tongliang Li, Zhoujun Li
- Abstract summary: Scene Text Recognition (STR) is a challenging task that involves recognizing text within images of natural scenes.
We propose the VIsion Permutable extractor for fast and efficient scene Text Recognition (VIPTR)
VIPTR achieves an impressive balance between high performance and rapid inference speeds in the domain of STR.
- Score: 32.12388950990217
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene Text Recognition (STR) is a challenging task that involves recognizing
text within images of natural scenes. Although current state-of-the-art models
for STR exhibit high performance, they typically suffer from low inference
efficiency due to their reliance on hybrid architectures comprised of visual
encoders and sequence decoders. In this work, we propose the VIsion Permutable
extractor for fast and efficient scene Text Recognition (VIPTR), which achieves
an impressive balance between high performance and rapid inference speeds in
the domain of STR. Specifically, VIPTR leverages a visual-semantic extractor
with a pyramid structure, characterized by multiple self-attention layers,
while eschewing the traditional sequence decoder. This design choice results in
a lightweight and efficient model capable of handling inputs of varying sizes.
Extensive experimental results on various standard datasets for both Chinese
and English scene text recognition validate the superiority of VIPTR. Notably,
the VIPTR-T (Tiny) variant delivers highly competitive accuracy on par with
other lightweight models and achieves SOTA inference speeds. Meanwhile, the
VIPTR-L (Large) variant attains greater recognition accuracy, while maintaining
a low parameter count and favorable inference speed. Our proposed method
provides a compelling solution for the STR challenge, which blends high
accuracy with efficiency and greatly benefits real-world applications requiring
fast and reliable text recognition. The code is publicly available at
https://github.com/cxfyxl/VIPTR.
Related papers
- DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Instruction-Guided Scene Text Recognition [51.853730414264625]
We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem.
We develop lightweight instruction encoder, cross-modal feature fusion module and multi-task answer head, which guides nuanced text image understanding.
IGTR outperforms existing models by significant margins, while maintaining a small model size and efficient inference speed.
arXiv Detail & Related papers (2024-01-31T14:13:01Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Geometric Perception based Efficient Text Recognition [0.0]
In real-world applications with fixed camera positions, the underlying data tends to be regular scene text.
This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models.
We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
arXiv Detail & Related papers (2023-02-08T04:19:24Z) - YORO -- Lightweight End to End Visual Grounding [58.17659561501071]
YORO is a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task.
It consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object.
YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins.
arXiv Detail & Related papers (2022-11-15T05:34:40Z) - SVTR: Scene Text Recognition with a Single Visual Model [44.26135584093631]
We propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework.
The method, termed SVTR, firstly decomposes an image text into small patches named character components.
Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR.
arXiv Detail & Related papers (2022-04-30T04:37:01Z) - Pushing the Performance Limit of Scene Text Recognizer without Human
Annotation [17.092815629040388]
We aim to boost STR models by leveraging both synthetic data and the numerous real unlabeled images.
A character-level consistency regularization is designed to mitigate the misalignment between characters in sequence recognition.
arXiv Detail & Related papers (2022-04-16T04:42:02Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.