A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution
- URL: http://arxiv.org/abs/2203.09388v2
- Date: Fri, 18 Mar 2022 03:07:32 GMT
- Title: A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution
- Authors: Jianqi Ma, Zhetong Liang, Lei Zhang
- Abstract summary: Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images.
It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones.
We propose a CNN based Text ATTention network (TATT) to address this problem.
- Score: 13.934846626570286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text image super-resolution aims to increase the resolution and
readability of the text in low-resolution images. Though significant
improvement has been achieved by deep convolutional neural networks (CNNs), it
remains difficult to reconstruct high-resolution images for spatially deformed
texts, especially rotated and curve-shaped ones. This is because the current
CNN-based methods adopt locality-based operations, which are not effective to
deal with the variation caused by deformations. In this paper, we propose a CNN
based Text ATTention network (TATT) to address this problem. The semantics of
the text are firstly extracted by a text recognition module as text prior
information. Then we design a novel transformer-based module, which leverages
global attention mechanism, to exert the semantic guidance of text prior to the
text reconstruction process. In addition, we propose a text structure
consistency loss to refine the visual appearance by imposing structural
consistency on the reconstructions of regular and deformed texts. Experiments
on the benchmark TextZoom dataset show that the proposed TATT not only achieves
state-of-the-art performance in terms of PSNR/SSIM metrics, but also
significantly improves the recognition accuracy in the downstream text
recognition task, particularly for text instances with multi-orientation and
curved shapes. Code is available at https://github.com/mjq11302010044/TATT.
Related papers
- WAS: Dataset and Methods for Artistic Text Segmentation [57.61335995536524]
This paper focuses on the more challenging task of artistic text segmentation and constructs a real artistic text segmentation dataset.
We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes.
We also propose a skeleton-assisted head to guide the model to focus on the global structure.
arXiv Detail & Related papers (2024-07-31T18:29:36Z) - PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution [18.936806519546508]
Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images.
Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly.
This paper proposes a Prior-Enhanced Attention Network (PEAN) to mitigate the effects from these factors.
arXiv Detail & Related papers (2023-11-29T08:11:20Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Text Prior Guided Scene Text Image Super-resolution [11.396781380648756]
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images.
We make an attempt to embed categorical text prior into STISR model training.
We present a multi-stage text prior guided super-resolution framework for STISR.
arXiv Detail & Related papers (2021-06-29T12:52:33Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.