Type-R: Automatically Retouching Typos for Text-to-Image Generation
- URL: http://arxiv.org/abs/2411.18159v2
- Date: Wed, 30 Apr 2025 10:24:23 GMT
- Title: Type-R: Automatically Retouching Typos for Text-to-Image Generation
- Authors: Wataru Shimoda, Naoto Inoue, Daichi Haraguchi, Hayato Mitani, Seiichi Uchida, Kota Yamaguchi,
- Abstract summary: We propose to retouch erroneous text renderings in the post-processing pipeline.<n>Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words.
- Score: 14.904165023640854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.
Related papers
- TextSR: Diffusion Super-Resolution with Multilingual OCR Guidance [24.242452422416438]
We introduce TextSR, a multimodal diffusion model specifically designed for Multilingual Text Image Super-Resolution.<n>By integrating text character priors with the low-resolution text images, our model effectively guides the super-resolution process.<n>The superior performance of our model on both the TextZoom and TextVQA datasets sets a new benchmark for STISR.
arXiv Detail & Related papers (2025-05-29T05:40:35Z) - Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models [76.68654868991517]
Long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models.
We introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features.
We develop ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity.
arXiv Detail & Related papers (2025-03-26T03:44:25Z) - TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.
We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z) - Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering [59.088036977605405]
Visual text rendering poses a fundamental challenge for text-to-image generation models.
We craft a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder.
We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation.
arXiv Detail & Related papers (2024-03-14T17:55:33Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.
PromptSR utilizes the pre-trained language model (e.g., T5 or CLIP) to enhance restoration.
Experiments indicate that introducing text prompts into SR, yields excellent results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z) - Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution [15.391125077873745]
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and legibility of text within low-resolution (LR) images.
Previous methods predominantly employ discriminative Convolutional Neural Networks (CNNs) augmented with diverse forms of text guidance.
We introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image Super-Resolution, which exhibits great generative diversity and fidelity even in challenging scenarios.
arXiv Detail & Related papers (2023-11-22T11:10:45Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - Semantic-Preserving Augmentation for Robust Image-Text Retrieval [27.2916415148638]
RVSE consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT)
Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors.
From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
arXiv Detail & Related papers (2023-03-10T03:50:44Z) - Text Prior Guided Scene Text Image Super-resolution [11.396781380648756]
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images.
We make an attempt to embed categorical text prior into STISR model training.
We present a multi-stage text prior guided super-resolution framework for STISR.
arXiv Detail & Related papers (2021-06-29T12:52:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.