Related papers: ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition

URL: http://arxiv.org/abs/2004.02070v2
Date: Tue, 7 Apr 2020 01:44:17 GMT
Title: ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition
Authors: Qi Song, Qianyi Jiang, Nan Li, Rui Zhang and Xiaolin Wei
Abstract summary: We elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. The ReADS can be trained end-to-end and only word-level annotations are required.
Score: 22.367624178280682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.

Related papers

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting [59.14029549151904]
We propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. It extracts important information in locations and transcriptions from bidirectional flows to improve consistency.
arXiv Detail & Related papers (2025-04-14T08:09:17Z)
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting [20.21019748095159]
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models.
arXiv Detail & Related papers (2025-02-28T01:09:18Z)
Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM) AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z)
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context.<n>We extensively evaluate SVTRv2 in both standard and recent challenging benchmarks.<n> SVTRv2 surpasses most EDTRs across the scenarios in terms of accuracy and inference speed.
arXiv Detail & Related papers (2024-11-24T14:21:35Z)
BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection [8.303512060791736]
Spoken term detection is often hindered by reliance on frame-level features and the computationally intensive DTW-based template matching. We propose a novel approach that encodes speech into discrete, speaker-agnostic semantic tokens. This facilitates fast retrieval using text-based search algorithms and effectively handles out-of-vocabulary terms.
arXiv Detail & Related papers (2024-11-21T13:05:18Z)
Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events. This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events. We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z)
Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition [56.968108142307976]
Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. Most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects.
arXiv Detail & Related papers (2024-02-24T13:00:54Z)
DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding. Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition. We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z)
Three ways to improve feature alignment for open vocabulary detection [88.65076922242184]
Key problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training. Secondly, the feature pyramid network and the detection head are modified to include trainable shortcuts. Finally, a self-training approach is used to leverage a larger corpus of
arXiv Detail & Related papers (2023-03-23T17:59:53Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting. TTS is trained with both fully- and weakly-supervised settings. trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z)
Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences [27.75850798545413]
We propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network) Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.
arXiv Detail & Related papers (2021-06-20T11:56:46Z)
Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter [38.4211220941874]
We propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA) IFA can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFAinference. We experimentally demonstrate that IFA achieves state-of-the-art performance on end-to-end document recognition tasks.
arXiv Detail & Related papers (2021-06-10T17:06:28Z)
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition. GSRM is introduced to capture global semantic context through multi-way parallel transmission. Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.