On Manipulating Scene Text in the Wild with Diffusion Models
- URL: http://arxiv.org/abs/2311.00734v2
- Date: Fri, 3 Nov 2023 10:11:52 GMT
- Title: On Manipulating Scene Text in the Wild with Diffusion Models
- Authors: Joshua Santoso, Christian Simon, Williem Pao
- Abstract summary: We introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST.
Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance.
Our method achieves 94.15% and 98.12% on datasets for character-level evaluation.
- Score: 4.034781390227754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have gained attention for image editing yielding impressive
results in text-to-image tasks. On the downside, one might notice that
generated images of stable diffusion models suffer from deteriorated details.
This pitfall impacts image editing tasks that require information preservation
e.g., scene text editing. As a desired result, the model must show the
capability to replace the text on the source image to the target text while
preserving the details e.g., color, font size, and background. To leverage the
potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene
Text manipulation Network so-called DBEST. Specifically, we design two
adaptation strategies, namely one-shot style adaptation and text-recognition
guidance. In experiments, we thoroughly assess and compare our proposed method
against state-of-the-arts on various scene text datasets, then provide
extensive ablation studies for each granularity to analyze our performance
gain. Also, we demonstrate the effectiveness of our proposed method to
synthesize scene text indicated by competitive Optical Character Recognition
(OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and
ICDAR2013 datasets for character-level evaluation.
Related papers
- Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models [16.00576040281808]
We propose a novel framework called Image2Text2Image to evaluate image captioning models.
A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies.
Our framework does not rely on human-annotated captions reference, making it a valuable tool for assessing image captioning models.
arXiv Detail & Related papers (2024-11-08T17:07:01Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model [31.819060415422353]
Diff-Text is a training-free scene text generation framework for any language.
Our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
arXiv Detail & Related papers (2023-12-19T15:18:40Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - DiffusionSTR: Diffusion Model for Scene Text Recognition [0.0]
Diffusion Model for Scene Text Recognition (DiffusionSTR) is an end-to-end text recognition framework.
We show for the first time that the diffusion model can be applied to text recognition.
arXiv Detail & Related papers (2023-06-29T06:09:32Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.