Related papers: On Manipulating Scene Text in the Wild with Diffusion Models

On Manipulating Scene Text in the Wild with Diffusion Models

URL: http://arxiv.org/abs/2311.00734v2
Date: Fri, 3 Nov 2023 10:11:52 GMT
Title: On Manipulating Scene Text in the Wild with Diffusion Models
Authors: Joshua Santoso, Christian Simon, Williem Pao
Abstract summary: We introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. Our method achieves 94.15% and 98.12% on datasets for character-level evaluation.
Score: 4.034781390227754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.

Related papers

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions [29.52344052330828]
We investigate how different synthetic captioning strategies impact the downstream performance of text-to-image models.<n>Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity.<n>Our findings underscore the importance of caption design in achieving optimal model performance.
arXiv Detail & Related papers (2025-06-20T01:52:17Z)
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies. Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z)
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z)
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language. Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z)
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z)
Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions. We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions. We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z)
DiffusionSTR: Diffusion Model for Scene Text Recognition [0.0]
Diffusion Model for Scene Text Recognition (DiffusionSTR) is an end-to-end text recognition framework. We show for the first time that the diffusion model can be applied to text recognition.
arXiv Detail & Related papers (2023-06-29T06:09:32Z)
Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.