RobustScanner: Dynamically Enhancing Positional Clues for Robust Text
Recognition
- URL: http://arxiv.org/abs/2007.07542v2
- Date: Fri, 17 Jul 2020 07:16:45 GMT
- Title: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text
Recognition
- Authors: Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, and Wayne Zhang
- Abstract summary: We show that a character-level sequence decoder utilizes not only context information but also positional information.
We propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition.
Our proposed method, dubbed emphRobustScanner, decodes individual characters with dynamic ratio between context and positional clues.
- Score: 31.62436356768889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The attention-based encoder-decoder framework has recently achieved
impressive results for scene text recognition, and many variants have emerged
with improvements in recognition quality. However, it performs poorly on
contextless texts (e.g., random character sequences) which is unacceptable in
most of real application scenarios. In this paper, we first deeply investigate
the decoding process of the decoder. We empirically find that a representative
character-level sequence decoder utilizes not only context information but also
positional information. Contextual information, which the existing approaches
heavily rely on, causes the problem of attention drift. To suppress such
side-effect, we propose a novel position enhancement branch, and dynamically
fuse its outputs with those of the decoder attention module for scene text
recognition. Specifically, it contains a position aware module to enable the
encoder to output feature vectors encoding their own spatial positions, and an
attention module to estimate glimpses using the positional clue (i.e., the
current decoding time step) only. The dynamic fusion is conducted for more
robust feature via an element-wise gate mechanism. Theoretically, our proposed
method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic
ratio between context and positional clues, and utilizes more positional ones
when the decoding sequences with scarce context, and thus is robust and
practical. Empirically, it has achieved new state-of-the-art results on popular
regular and irregular text recognition benchmarks while without much
performance drop on contextless benchmarks, validating its robustness in both
contextual and contextless application scenarios.
Related papers
- Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition [5.525052547053668]
Scene text recognition has attracted more and more attention due to its diverse applications.
Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right.
We propose an alternative solution, using a parallel and iterative decoder that adopts an easy-first decoding strategy.
arXiv Detail & Related papers (2023-12-19T08:03:19Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy
in Transformer [88.61312640540902]
We introduce Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter)
Our model achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder.
Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2023-08-20T03:22:23Z) - DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text
Spotting [129.73247700864385]
DeepSolo is a simple detection transformer baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously.
We introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training.
arXiv Detail & Related papers (2022-11-19T19:06:22Z) - Representation and Correlation Enhanced Encoder-Decoder Framework for
Scene Text Recognition [10.496558786568672]
We propose a Representation and Correlation Enhanced-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck.
In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map.
In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space.
arXiv Detail & Related papers (2021-06-13T10:36:56Z) - MANGO: A Mask Attention Guided One-Stage Scene Text Spotter [41.66707532607276]
We propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO.
The proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks.
arXiv Detail & Related papers (2020-12-08T10:47:49Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text
Recognition [17.191496890376197]
We propose a semantics enhanced encoder-decoder framework to robustly recognize low-quality scene texts.
The proposed framework is more robust for low-quality text images, and achieves state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2020-05-22T03:02:46Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.