Related papers: Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution

Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution

URL: http://arxiv.org/abs/2112.08171v1
Date: Mon, 13 Dec 2021 15:26:10 GMT
Title: Text Gestalt: Stroke-Aware Scene Text Image Super-Resolution
Authors: Jingye Chen, Haiyang Yu, Jianqi Ma, Bin Li, Xiangyang Xue
Abstract summary: We propose a Stroke-Aware Scene Text Image Super-Resolution method containing a Stroke-Focused Module (SFM) to concentrate on stroke-level internal structures of characters in text images. Specifically, we attempt to design rules for decomposing English characters and digits at stroke-level, then pre-train a text recognizer to provide stroke-level attention maps as positional clues. The proposed method can indeed generate more distinguishable images on TextZoom and manually constructed Chinese character dataset Degraded-IC13.
Score: 31.88960656995447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the last decade, the blossom of deep learning has witnessed the rapid development of scene text recognition. However, the recognition of low-resolution scene text images remains a challenge. Even though some super-resolution methods have been proposed to tackle this problem, they usually treat text images as general images while ignoring the fact that the visual quality of strokes (the atomic unit of text) plays an essential role for text recognition. According to Gestalt Psychology, humans are capable of composing parts of details into the most similar objects guided by prior knowledge. Likewise, when humans observe a low-resolution text image, they will inherently use partial stroke-level details to recover the appearance of holistic characters. Inspired by Gestalt Psychology, we put forward a Stroke-Aware Scene Text Image Super-Resolution method containing a Stroke-Focused Module (SFM) to concentrate on stroke-level internal structures of characters in text images. Specifically, we attempt to design rules for decomposing English characters and digits at stroke-level, then pre-train a text recognizer to provide stroke-level attention maps as positional clues with the purpose of controlling the consistency between the generated super-resolution image and high-resolution ground truth. The extensive experimental results validate that the proposed method can indeed generate more distinguishable images on TextZoom and manually constructed Chinese character dataset Degraded-IC13. Furthermore, since the proposed SFM is only used to provide stroke-level guidance when training, it will not bring any time overhead during the test phase. Code is available at https://github.com/FudanVI/FudanOCR/tree/main/text-gestalt.

Related papers

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing [23.64662356622401]
We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision.<n>Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art scene text editing baseline.
arXiv Detail & Related papers (2025-05-08T03:11:58Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
Text Image Inpainting via Global Structure-Guided Diffusion Models [22.859984320894135]
Real-world text can be damaged by corrosion issues caused by environmental or human factors. Current inpainting techniques often fail to adequately address this problem. We develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution.
arXiv Detail & Related papers (2024-01-26T13:01:28Z)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language. Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z)
Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images. Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z)
TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution. It achieves state-of-the-art (SOTA) performance on public benchmark datasets. Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z)
Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning. CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z)
Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework. It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss. It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z)
Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition [63.6608759501803]
We propose to recognize artistic text at three levels. corner points are applied to guide the extraction of local features inside characters, considering the robustness of corner structures to appearance and shape. Secondly, we design a character contrastive loss to model the character-level feature, improving the feature representation for character classification. Thirdly, we utilize Transformer to learn the global feature on image-level and model the global relationship of the corner points.
arXiv Detail & Related papers (2022-07-31T14:11:05Z)
A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images. It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones. We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.