Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders
- URL: http://arxiv.org/abs/2506.04641v1
- Date: Thu, 05 Jun 2025 05:23:10 GMT
- Title: Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders
- Authors: Qiming Hu, Linlong Fan, Yiyan Luo, Yuhang Yu, Xiaojie Guo, Qingnan Fan,
- Abstract summary: We introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders.<n>We propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks.<n>Our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics.
- Score: 14.655107789528673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{https://github.com/mingcv/TADiSR}{here}.
Related papers
- TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance [24.242452422416438]
We introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Text Image Super-Resolution.<n>By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process.<n>The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR.
arXiv Detail & Related papers (2025-05-29T05:40:35Z) - HoliSDiP: Image Super-Resolution via Holistic Semantics and Diffusion Prior [62.04939047885834]
We present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for Real-ISR.<n>Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed spatial-CLIP Map.
arXiv Detail & Related papers (2024-11-27T15:22:44Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images.
We achieve this by marrying image appearance and language understanding to generate a cognitive embedding.
To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z) - Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.<n>PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images.<n>Experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z) - Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z) - Scene Text Image Super-resolution based on Text-conditional Diffusion
Models [0.0]
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition.
In this study, we leverage text-conditional diffusion models (DMs) for STISR tasks.
We propose a novel framework for LR-HR paired text image datasets.
arXiv Detail & Related papers (2023-11-16T10:32:18Z) - Scene Text Image Super-Resolution via Content Perceptual Loss and
Criss-Cross Transformer Blocks [48.81850740907517]
We present TATSR, a Text-Aware Text Super-Resolution framework.
It effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss.
It outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.
arXiv Detail & Related papers (2022-10-13T11:48:45Z) - Rethinking Super-Resolution as Text-Guided Details Generation [21.695227836312835]
We propose a Text-Guided Super-Resolution (TGSR) framework, which can effectively utilize the information from the text and image modalities.
The proposed TGSR could generate HR image details that match the text descriptions through a coarse-to-fine process.
arXiv Detail & Related papers (2022-07-14T01:46:38Z) - A Text Attention Network for Spatial Deformation Robust Scene Text Image
Super-resolution [13.934846626570286]
Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images.
It remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones.
We propose a CNN based Text ATTention network (TATT) to address this problem.
arXiv Detail & Related papers (2022-03-17T15:28:29Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.