TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution
- URL: http://arxiv.org/abs/2308.06743v1
- Date: Sun, 13 Aug 2023 11:02:16 GMT
- Title: TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution
- Authors: Baolin Liu and Zongyuan Yang and Pengfei Wang and Junjie Zhou and Ziqi
Liu and Ziyi Song and Yan Liu and Yongping Xiong
- Abstract summary: TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
- Score: 18.73348268987249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of scene text image super-resolution is to reconstruct
high-resolution text-line images from unrecognizable low-resolution inputs. The
existing methods relying on the optimization of pixel-level loss tend to yield
text edges that exhibit a notable degree of blurring, thereby exerting a
substantial impact on both the readability and recognizability of the text. To
address these issues, we propose TextDiff, the first diffusion-based framework
tailored for scene text image super-resolution. It contains two modules: the
Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module
(MRD). The TEM generates an initial deblurred text image and a mask that
encodes the spatial location of the text. The MRD is responsible for
effectively sharpening the text edge by modeling the residuals between the
ground-truth images and the initial deblurred images. Extensive experiments
demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on
public benchmark datasets and can improve the readability of scene text images.
Moreover, our proposed MRD module is plug-and-play that effectively sharpens
the text edges produced by SOTA methods. This enhancement not only improves the
readability and recognizability of the results generated by SOTA methods but
also does not require any additional joint training. Available
Codes:https://github.com/Lenubolim/TextDiff.
Related papers
- Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition [56.968108142307976]
We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM)
Our approach introduces canonical class-aware glyph masks to suppress background and text style noise.
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
arXiv Detail & Related papers (2024-02-21T09:22:45Z) - PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution [18.936806519546508]
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images.
Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly.
This paper proposes a Prior-Enhanced Attention Network (PEAN) to mitigate the effects from these factors.
arXiv Detail & Related papers (2023-11-29T08:11:20Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Scene Text Image Super-resolution based on Text-conditional Diffusion
Models [0.0]
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition.
In this study, we leverage text-conditional diffusion models (DMs) for STISR tasks.
We propose a novel framework for LR-HR paired text image datasets.
arXiv Detail & Related papers (2023-11-16T10:32:18Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network [20.687100711699788]
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
arXiv Detail & Related papers (2023-02-21T02:59:37Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images.
It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones.
We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.