TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
- URL: http://arxiv.org/abs/2505.23119v1
- Date: Thu, 29 May 2025 05:40:35 GMT
- Title: TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance
- Authors: Keren Ye, Ignacio Garcia Dorado, Michalis Raptis, Mauricio Delbracio, Irene Zhu, Peyman Milanfar, Hossein Talebi,
- Abstract summary: We introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Text Image Super-Resolution.<n>By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process.<n>The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR.
- Score: 24.242452422416438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent advancements in Image Super-Resolution (SR) using diffusion models have shown promise in improving overall image quality, their application to scene text images has revealed limitations. These models often struggle with accurate text region localization and fail to effectively model image and multilingual character-to-shape priors. This leads to inconsistencies, the generation of hallucinated textures, and a decrease in the perceived quality of the super-resolved text. To address these issues, we introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Scene Text Image Super-Resolution. TextSR leverages a text detector to pinpoint text regions within an image and then employs Optical Character Recognition (OCR) to extract multilingual text from these areas. The extracted text characters are then transformed into visual shapes using a UTF-8 based text encoder and cross-attention. Recognizing that OCR may sometimes produce inaccurate results in real-world scenarios, we have developed two innovative methods to enhance the robustness of our model. By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process, enhancing fine details within the text and improving overall legibility. The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR, underscoring the efficacy of our approach.
Related papers
- Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders [14.655107789528673]
We introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders.<n>We propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks.<n>Our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics.
arXiv Detail & Related papers (2025-06-05T05:23:10Z) - TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.<n>We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models [52.23899502520261]
We introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically.<n>We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.<n>This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.<n>PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images.<n>Experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z) - Scene Text Image Super-resolution based on Text-conditional Diffusion
Models [0.0]
Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition.
In this study, we leverage text-conditional diffusion models (DMs) for STISR tasks.
We propose a novel framework for LR-HR paired text image datasets.
arXiv Detail & Related papers (2023-11-16T10:32:18Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution [17.95994419104427]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.<n>It achieves state-of-the-art (SOTA) performance on public benchmark datasets.<n>Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - Text Prior Guided Scene Text Image Super-resolution [11.396781380648756]
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images.
We make an attempt to embed categorical text prior into STISR model training.
We present a multi-stage text prior guided super-resolution framework for STISR.
arXiv Detail & Related papers (2021-06-29T12:52:33Z) - Scene Text Image Super-Resolution in the Wild [112.90416737357141]
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones.
Previous single image super-resolution (SISR) methods are trained on synthetic low-resolution images.
We pro-pose a real scene text SR dataset, termed TextZoom.
It contains paired real low-resolution and high-resolution images captured by cameras with different focal length in the wild.
arXiv Detail & Related papers (2020-05-07T09:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.