Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network
- URL: http://arxiv.org/abs/2302.10414v2
- Date: Thu, 30 Nov 2023 02:27:49 GMT
- Title: Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network
- Authors: Shipeng Zhu, Zuoyan Zhao, Pengfei Fang, Hui Xue
- Abstract summary: Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
- Score: 20.687100711699788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text image super-resolution (STISR) aims to simultaneously increase the
resolution and legibility of the text images, and the resulting images will
significantly affect the performance of downstream tasks. Although numerous
progress has been made, existing approaches raise two crucial issues: (1) They
neglect the global structure of the text, which bounds the semantic determinism
of the scene text. (2) The priors, e.g., text prior or stroke prior, employed
in existing works, are extracted from pre-trained text recognizers. That said,
such priors suffer from the domain gap including low resolution and blurriness
caused by poor imaging conditions, leading to incorrect guidance. Our work
addresses these gaps and proposes a plug-and-play module dubbed Dual Prior
Modulation Network (DPMN), which leverages dual image-level priors to bring
performance gain over existing approaches. Specifically, two types of
prior-guided refinement modules, each using the text mask or graphic
recognition result of the low-quality SR image from the preceding layer, are
designed to improve the structural clarity and semantic accuracy of the text,
respectively. The following attention mechanism hence modulates two
quality-enhanced images to attain a superior SR result. Extensive experiments
validate that our method improves the image quality and boosts the performance
of downstream tasks over five typical approaches on the benchmark. Substantial
visualizations and ablation studies demonstrate the advantages of the proposed
DPMN. Code is available at: https://github.com/jdfxzzy/DPMN.
Related papers
- Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution [18.936806519546508]
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images.
Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly.
This paper proposes a Prior-Enhanced Attention Network (PEAN) to mitigate the effects from these factors.
arXiv Detail & Related papers (2023-11-29T08:11:20Z) - Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - PerceptionGAN: Real-world Image Construction from Provided Text through
Perceptual Understanding [11.985768957782641]
We propose a method to provide good images by incorporating perceptual understanding in the discriminator module.
We show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages.
More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models.
arXiv Detail & Related papers (2020-07-02T09:23:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.