A Benchmark for Chinese-English Scene Text Image Super-resolution
- URL: http://arxiv.org/abs/2308.03262v1
- Date: Mon, 7 Aug 2023 02:57:48 GMT
- Title: A Benchmark for Chinese-English Scene Text Image Super-resolution
- Authors: Jianqi Ma, Zhetong Liang, Wangmeng Xiang, Xi Yang, Lei Zhang
- Abstract summary: Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from low-resolution (LR) input.
Most existing works focus on recovering English texts, which have relatively simple character structures.
We propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR.
- Score: 15.042152725255171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene Text Image Super-resolution (STISR) aims to recover high-resolution
(HR) scene text images with visually pleasant and readable text content from
the given low-resolution (LR) input. Most existing works focus on recovering
English texts, which have relatively simple character structures, while little
work has been done on the more challenging Chinese texts with diverse and
complex character structures. In this paper, we propose a real-world
Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with
the emphasis on restoring structurally complex Chinese characters. The
benchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789
text lines in total) for training/testing in 2$\times$ and 4$\times$ zooming
modes, complemented by detailed annotations, including detection boxes and text
transcripts. Moreover, we design an edge-aware learning method, which provides
structural supervision in image and feature domains, to effectively reconstruct
the dense structures of Chinese characters. We conduct experiments on the
proposed Real-CE benchmark and evaluate the existing STISR models with and
without our edge-aware loss. The benchmark, including data and source code, is
available at https://github.com/mjq11302010044/Real-CE.
Related papers
- VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - Scene Text Image Super-resolution based on Text-conditional Diffusion
Models [0.0]
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition.
In this study, we leverage text-conditional diffusion models (DMs) for STISR tasks.
We propose a novel framework for LR-HR paired text image datasets.
arXiv Detail & Related papers (2023-11-16T10:32:18Z) - Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Text Prior Guided Scene Text Image Super-resolution [11.396781380648756]
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images.
We make an attempt to embed categorical text prior into STISR model training.
We present a multi-stage text prior guided super-resolution framework for STISR.
arXiv Detail & Related papers (2021-06-29T12:52:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.