ReADS: A Rectified Attentional Double Supervised Network for Scene Text
Recognition
- URL: http://arxiv.org/abs/2004.02070v2
- Date: Tue, 7 Apr 2020 01:44:17 GMT
- Title: ReADS: A Rectified Attentional Double Supervised Network for Scene Text
Recognition
- Authors: Qi Song, Qianyi Jiang, Nan Li, Rui Zhang and Xiaolin Wei
- Abstract summary: We elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition.
The ReADS can be trained end-to-end and only word-level annotations are required.
- Score: 22.367624178280682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, scene text recognition is always regarded as a
sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and
Attentional sequence recognition (Attn) are two very prevailing approaches to
tackle this problem while they may fail in some scenarios respectively. CTC
concentrates more on every individual character but is weak in text semantic
dependency modeling. Attn based methods have better context semantic modeling
ability while tends to overfit on limited training data. In this paper, we
elaborately design a Rectified Attentional Double Supervised Network (ReADS)
for general scene text recognition. To overcome the weakness of CTC and Attn,
both of them are applied in our method but with different modules in two
supervised branches which can make a complementary to each other. Moreover,
effective spatial and channel attention mechanisms are introduced to eliminate
background noise and extract valid foreground information. Finally, a simple
rectified network is implemented to rectify irregular text. The ReADS can be
trained end-to-end and only word-level annotations are required. Extensive
experiments on various benchmarks verify the effectiveness of ReADS which
achieves state-of-the-art performance.
Related papers
- BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection [8.303512060791736]
Spoken term detection is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching.
We propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens.
This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms.
arXiv Detail & Related papers (2024-11-21T13:05:18Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Sequential Visual and Semantic Consistency for Semi-supervised Text
Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training.
Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models.
This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining.
We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training.
Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts.
Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z) - Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for
Visual Information Extraction using Sequences [27.75850798545413]
We propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network)
Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.
arXiv Detail & Related papers (2021-06-20T11:56:46Z) - Implicit Feature Alignment: Learn to Convert Text Recognizer to Text
Spotter [38.4211220941874]
We propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA)
IFA can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFAinference.
We experimentally demonstrate that IFA achieves state-of-the-art performance on end-to-end document recognition tasks.
arXiv Detail & Related papers (2021-06-10T17:06:28Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.