Toward Understanding WordArt: Corner-Guided Transformer for Scene Text
Recognition
- URL: http://arxiv.org/abs/2208.00438v1
- Date: Sun, 31 Jul 2022 14:11:05 GMT
- Title: Toward Understanding WordArt: Corner-Guided Transformer for Scene Text
Recognition
- Authors: Xudong Xie, Ling Fu, Zhifei Zhang, Zhaowen Wang, Xiang Bai
- Abstract summary: We propose to recognize artistic text at three levels.
corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape.
Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification.
Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points.
- Score: 63.6608759501803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artistic text recognition is an extremely challenging task with a wide range
of applications. However, current scene text recognition methods mainly focus
on irregular text while have not explored artistic text specifically. The
challenges of artistic text recognition include the various appearance with
special-designed fonts and effects, the complex connections and overlaps
between characters, and the severe interference from background patterns. To
alleviate these problems, we propose to recognize the artistic text at three
levels. Firstly, corner points are applied to guide the extraction of local
features inside characters, considering the robustness of corner structures to
appearance and shape. In this way, the discreteness of the corner points cuts
off the connection between characters, and the sparsity of them improves the
robustness for background interference. Secondly, we design a character
contrastive loss to model the character-level feature, improving the feature
representation for character classification. Thirdly, we utilize Transformer to
learn the global feature on image-level and model the global relationship of
the corner points, with the assistance of a corner-query cross-attention
mechanism. Besides, we provide an artistic text dataset to benchmark the
performance. Experimental results verify the significant superiority of our
proposed method on artistic text recognition and also achieve state-of-the-art
performance on several blurred and perspective datasets.
Related papers
- VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models [53.59400446543756]
We introduce a dual-branch and training-free method, namely VitaGlyph, to enable flexible artistic typography.
VitaGlyph treats input character as a scene composed of Subject and Surrounding, followed by rendering them under varying degrees of geometry transformation.
Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability, but also manages to depict multiple customize concepts.
arXiv Detail & Related papers (2024-10-02T16:48:47Z) - CLII: Visual-Text Inpainting via Cross-Modal Predictive Interaction [23.683636588751753]
State-of-the-art inpainting methods are mainly designed for nature images and cannot correctly recover text within scene text images.
We identify the visual-text inpainting task to achieve high-quality scene text image restoration and text completion.
arXiv Detail & Related papers (2024-07-23T06:12:19Z) - Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - Deformation Robust Text Spotting with Geometric Prior [5.639053898266709]
We develop a robust text spotting method (DR TextSpotter) to solve the recognition problem of complex deformation of characters in different fonts.
A graph convolution network is constructed to fuse the character features and landmark features, and then performs semantic reasoning to enhance the discrimination for different characters.
arXiv Detail & Related papers (2023-08-31T02:13:15Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.