Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition
- URL: http://arxiv.org/abs/2305.05140v2
- Date: Wed, 10 May 2023 12:55:57 GMT
- Title: Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition
- Authors: Boqiang Zhang, Hongtao Xie, Yuxin Wang, Jianjun Xu, Yongdong Zhang
- Abstract summary: Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.
Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper.
We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
- Score: 92.6211155264297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision model have gained increasing attention due to their simplicity and
efficiency in Scene Text Recognition (STR) task. However, due to lacking the
perception of linguistic knowledge and information, recent vision models suffer
from two problems: (1) the pure vision-based query results in attention drift,
which usually causes poor recognition and is summarized as linguistic
insensitive drift (LID) problem in this paper. (2) the visual feature is
suboptimal for the recognition in some vision-missing cases (e.g. occlusion,
etc.). To address these issues, we propose a $\textbf{L}$inguistic
$\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the
linguistic capability of vision model for accurate text recognition. To
alleviate the LID problem, we introduce a Cascade Position Attention (CPA)
mechanism that obtains high-quality and accurate attention maps through
step-wise optimization and linguistic information mining. Furthermore, a Global
Linguistic Reconstruction Module (GLRM) is proposed to improve the
representation of visual features by perceiving the linguistic information in
the visual space, which gradually converts visual features into semantically
rich ones during the cascade process. Different from previous methods, our
method obtains SOTA results while keeping low complexity (92.4% accuracy with
only 8.11M parameters). Code is available at
https://github.com/CyrilSterling/LPV.
Related papers
- Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks.
LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension.
We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z) - Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding [14.701135083174918]
Large Vision-Language Models (LVLMs) generate detailed and coherent responses from visual inputs.
They are prone to generate hallucinations due to an over-reliance on language priors.
We propose a novel method, Summary-Guided Decoding (SGD)
arXiv Detail & Related papers (2024-10-17T08:24:27Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - Exploring Part-Informed Visual-Language Learning for Person
Re-Identification [40.725052076983516]
We propose to enhance fine-grained visual features with part-informed language supervision for visual-based person re-identification tasks.
Our $pi$-VL achieves substantial improvements over previous state-of-the-arts on four common-used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting [121.11880210592497]
We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
arXiv Detail & Related papers (2022-11-19T03:50:33Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.