Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement
- URL: http://arxiv.org/abs/2307.09749v2
- Date: Sun, 30 Jul 2023 03:53:29 GMT
- Title: Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement
- Authors: Hang Guo, Tao Dai, Guanghao Meng, Shu-Tao Xia
- Abstract summary: Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
- Score: 59.66539728681453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text image super-resolution (STISR), aiming to improve image quality
while boosting downstream scene text recognition accuracy, has recently
achieved great success. However, most existing methods treat the foreground
(character regions) and background (non-character regions) equally in the
forward process, and neglect the disturbance from the complex background, thus
limiting the performance. To address these issues, in this paper, we propose a
novel method LEMMA that explicitly models character regions to produce
high-level text-specific guidance for super-resolution. To model the location
of characters effectively, we propose the location enhancement module to
extract character region features based on the attention map sequence. Besides,
we propose the multi-modal alignment module to perform bidirectional
visual-semantic alignment to generate high-quality prior guidance, which is
then incorporated into the super-resolution branch in an adaptive manner using
the proposed adaptive fusion module. Experiments on TextZoom and four scene
text recognition benchmarks demonstrate the superiority of our method over
other state-of-the-art methods. Code is available at
https://github.com/csguoh/LEMMA.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution [54.05367433562495]
Region-level multi-modality methods can translate referred image regions to human preferred language descriptions.
Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions.
We propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring.
arXiv Detail & Related papers (2024-05-25T05:44:55Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Improving Scene Text Image Super-resolution via Dual Prior Modulation
Network [20.687100711699788]
Scene text image super-resolution (STISR) aims to simultaneously increase the resolution and legibility of the text images.
Existing approaches neglect the global structure of the text, which bounds the semantic determinism of the scene text.
Our work proposes a plug-and-play module dubbed Dual Prior Modulation Network (DPMN), which leverages dual image-level priors to bring performance gain over existing approaches.
arXiv Detail & Related papers (2023-02-21T02:59:37Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - DT2I: Dense Text-to-Image Generation from Region Descriptions [3.883984493622102]
We introduce dense text-to-image (DT2I) synthesis as a new task to pave the way toward more intuitive image generation.
We also propose DTC-GAN, a novel method to generate images from semantically rich region descriptions.
arXiv Detail & Related papers (2022-04-05T07:57:11Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.