IterVM: Iterative Vision Modeling Module for Scene Text Recognition
- URL: http://arxiv.org/abs/2204.02630v1
- Date: Wed, 6 Apr 2022 07:19:28 GMT
- Title: IterVM: Iterative Vision Modeling Module for Scene Text Recognition
- Authors: Xiaojie Chu and Yongtao Wang
- Abstract summary: Scene text recognition (STR) is a challenging problem due to imperfect imagery conditions in natural images.
We propose iterative vision modeling module (IterVM) to further improve the STR accuracy.
IterVM can significantly improve the scene text recognition accuracy, especially on low-quality scene text images.
- Score: 10.417738567452947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) is a challenging problem due to the imperfect
imagery conditions in natural images. State-of-the-art methods utilize both
visual cues and linguistic knowledge to tackle this challenging problem.
Specifically, they propose iterative language modeling module (IterLM) to
repeatedly refine the output sequence from the visual modeling module (VM).
Though achieving promising results, the vision modeling module has become the
performance bottleneck of these methods. In this paper, we newly propose
iterative vision modeling module (IterVM) to further improve the STR accuracy.
Specifically, the first VM directly extracts multi-level features from the
input image, and the following VMs re-extract multi-level features from the
input image and fuse them with the high-level (i.e., the most semantic one)
feature extracted by the previous VM. By combining the proposed IterVM with
iterative language modeling module, we further propose a powerful scene text
recognizer called IterNet. Extensive experiments demonstrate that the proposed
IterVM can significantly improve the scene text recognition accuracy,
especially on low-quality scene text images. Moreover, the proposed scene text
recognizer IterNet achieves new state-of-the-art results on several public
benchmarks. Codes will be available at https://github.com/VDIGPKU/IterNet.
Related papers
- Attention Prompting on Image for Large Vision-Language Models [63.794304207664176]
We propose a new prompting technique named Attention Prompting on Image.
We generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP.
Experiments on various vison-language benchmarks verify the effectiveness of our technique.
arXiv Detail & Related papers (2024-09-25T17:59:13Z) - AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions.
We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images.
Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning.
This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images.
The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - OSIC: A New One-Stage Image Captioner Coined [38.46732302316068]
We propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning.
To obtain rich features, we use the Swin Transformer to calculate multi-level features.
To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module.
arXiv Detail & Related papers (2022-11-04T08:50:09Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.