CDistNet: Perceiving Multi-Domain Character Distance for Robust Text
Recognition
- URL: http://arxiv.org/abs/2111.11011v5
- Date: Sun, 27 Aug 2023 02:55:53 GMT
- Title: CDistNet: Perceiving Multi-Domain Character Distance for Robust Text
Recognition
- Authors: Tianlun Zheng, Zhineng Chen, Shancheng Fang, Hongtao Xie, Yu-Gang
Jiang
- Abstract summary: We propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding.
MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism.
We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling.
- Score: 87.3894423816705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer-based encoder-decoder framework is becoming popular in scene
text recognition, largely because it naturally integrates recognition clues
from both visual and semantic domains. However, recent studies show that the
two kinds of clues are not always well registered and therefore, feature and
character might be misaligned in difficult text (e.g., with a rare shape). As a
result, constraints such as character position are introduced to alleviate this
problem. Despite certain success, visual and semantic are still separately
modeled and they are merely loosely associated. In this paper, we propose a
novel module called Multi-Domain Character Distance Perception (MDCDP) to
establish a visually and semantically related position embedding. MDCDP uses
the position embedding to query both visual and semantic features following the
cross-attention mechanism. The two kinds of clues are fused into the position
branch, generating a content-aware embedding that well perceives character
spacing and orientation variants, character semantic affinities, and clues
tying the two kinds of information. They are summarized as the multi-domain
character distance. We develop CDistNet that stacks multiple MDCDPs to guide a
gradually precise distance modeling. Thus, the feature-character alignment is
well built even various recognition difficulties are presented. We verify
CDistNet on ten challenging public datasets and two series of augmented
datasets created by ourselves. The experiments demonstrate that CDistNet
performs highly competitively. It not only ranks top-tier in standard
benchmarks, but also outperforms recent popular methods by obvious margins on
real and augmented datasets presenting severe text deformation, poor linguistic
support, and rare character layouts. Code is available at
https://github.com/simplify23/CDistNet.
Related papers
- GatedLexiconNet: A Comprehensive End-to-End Handwritten Paragraph Text Recognition System [3.9527064697847005]
We present an end-to-end paragraph recognition system that incorporates internal line segmentation and convolutional layers based encoder.
This study reported character error rates of 2.27% on IAM, 0.9% on RIMES, and 2.13% on READ-16, and word error rates of 5.73% on READ-2016 datasets.
arXiv Detail & Related papers (2024-04-22T10:19:16Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media [11.235498285650142]
Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
arXiv Detail & Related papers (2022-10-19T15:14:55Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text
Recognition [36.12001394921506]
State-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts.
This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities.
We propose a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning.
arXiv Detail & Related papers (2021-07-26T10:15:14Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - MANGO: A Mask Attention Guided One-Stage Scene Text Spotter [41.66707532607276]
We propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO.
The proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks.
arXiv Detail & Related papers (2020-12-08T10:47:49Z) - Text Recognition in Real Scenarios with a Few Labeled Samples [55.07859517380136]
Scene text recognition (STR) is still a hot research topic in computer vision field.
This paper proposes a few-shot adversarial sequence domain adaptation (FASDA) approach to build sequence adaptation.
Our approach can maximize the character-level confusion between the source domain and the target domain.
arXiv Detail & Related papers (2020-06-22T13:03:01Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.