ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting
- URL: http://arxiv.org/abs/2211.10578v1
- Date: Sat, 19 Nov 2022 03:50:33 GMT
- Title: ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for
Scene Text Spotting
- Authors: Shancheng Fang, Zhendong Mao, Hongtao Xie, Yuxin Wang, Chenggang Yan,
Yongdong Zhang
- Abstract summary: We argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input.
We propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting.
- Score: 121.11880210592497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text spotting is of great importance to the computer vision community
due to its wide variety of applications. Recent methods attempt to introduce
linguistic knowledge for challenging recognition rather than pure visual
classification. However, how to effectively model the linguistic rules in
end-to-end deep networks remains a research challenge. In this paper, we argue
that the limited capacity of language models comes from 1) implicit language
modeling; 2) unidirectional feature representation; and 3) language model with
noise input. Correspondingly, we propose an autonomous, bidirectional and
iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests
enforcing explicitly language modeling by decoupling the recognizer into vision
model and language model and blocking gradient flow between both models.
Secondly, a novel bidirectional cloze network (BCN) as the language model is
proposed based on bidirectional feature representation. Thirdly, we propose an
execution manner of iterative correction for the language model which can
effectively alleviate the impact of noise input. Finally, to polish ABINet++ in
long text recognition, we propose to aggregate horizontal features by embedding
Transformer units inside a U-Net, and design a position and content attention
module which integrates character order and content to attend to character
features precisely. ABINet++ achieves state-of-the-art performance on both
scene text recognition and scene text spotting benchmarks, which consistently
demonstrates the superiority of our method in various environments especially
on low-quality images. Besides, extensive experiments including in English and
Chinese also prove that, a text spotter that incorporates our language modeling
method can significantly improve its performance both in accuracy and speed
compared with commonly used attention-based recognizers.
Related papers
- Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts.
Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head.
We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.