Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution
- URL: http://arxiv.org/abs/2311.13317v1
- Date: Wed, 22 Nov 2023 11:10:45 GMT
- Title: Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution
- Authors: Yuxuan Zhou, Liangcai Gao, Zhi Tang, Baole Wei
- Abstract summary: Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
- Score: 15.391125077873745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and
legibility of text within low-resolution (LR) images, consequently elevating
recognition accuracy in Scene Text Recognition (STR). Previous methods
predominantly employ discriminative Convolutional Neural Networks (CNNs)
augmented with diverse forms of text guidance to address this issue.
Nevertheless, they remain deficient when confronted with severely blurred
images, due to their insufficient generation capability when little structural
or semantic information can be extracted from original images. Therefore, we
introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image
Super-Resolution, which exhibits great generative diversity and fidelity even
in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising
Network, to guide the diffusion model generating LR-consistent results through
succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the
superiority of RGDiffSR over prior state-of-the-art methods in both text
recognition accuracy and image fidelity.
Related papers
- PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution [18.936806519546508]
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images.
Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly.
This paper proposes a Prior-Enhanced Attention Network (PEAN) to mitigate the effects from these factors.
arXiv Detail & Related papers (2023-11-29T08:11:20Z) - CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images.
We achieve this by marrying image appearance and language understanding to generate a cognitive embedding.
To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z) - Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.
PromptSR utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance restoration.
Experiments indicate that introducing text prompts into SR, yields excellent results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z) - RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model [0.8747606955991705]
This research introduces a two-stage diffusion model methodology for synthesizing high-resolution satellite images from textual prompts.
The pipeline comprises a Low-Resolution Diffusion Model (LRDM) that generates initial images based on text inputs and a Super-Resolution Diffusion Model (SRDM) that refines these images into high-resolution outputs.
arXiv Detail & Related papers (2023-09-03T09:34:49Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - Rethinking Super-Resolution as Text-Guided Details Generation [21.695227836312835]
We propose a Text-Guided Super-Resolution (TGSR) framework, which can effectively utilize the information from the text and image modalities.
The proposed TGSR could generate HR image details that match the text descriptions through a coarse-to-fine process.
arXiv Detail & Related papers (2022-07-14T01:46:38Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.