Improving Diffusion Models for Scene Text Editing with Dual Encoders
- URL: http://arxiv.org/abs/2304.05568v1
- Date: Wed, 12 Apr 2023 02:08:34 GMT
- Title: Improving Diffusion Models for Scene Text Editing with Dual Encoders
- Authors: Jiabao Ji, Guanhua Zhang, Zhaowen Wang, Bairu Hou, Zhifei Zhang, Brian
Price, Shiyu Chang
- Abstract summary: Scene text editing is a challenging task that involves modifying or inserting specified texts in an image.
Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing.
We propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design.
- Score: 44.12999932588205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene text editing is a challenging task that involves modifying or inserting
specified texts in an image while maintaining its natural and realistic
appearance. Most previous approaches to this task rely on style-transfer models
that crop out text regions and feed them into image transfer models, such as
GANs. However, these methods are limited in their ability to change text style
and are unable to insert texts into images. Recent advances in diffusion models
have shown promise in overcoming these limitations with text-conditional image
editing. However, our empirical analysis reveals that state-of-the-art
diffusion models struggle with rendering correct text and controlling text
style. To address these problems, we propose DIFFSTE to improve pre-trained
diffusion models with a dual encoder design, which includes a character encoder
for better text legibility and an instruction encoder for better style control.
An instruction tuning framework is introduced to train our model to learn the
mapping from the text instruction to the corresponding image with either the
specified style or the style of the surrounding texts in the background. Such a
training method further brings our method the zero-shot generalization ability
to the following three scenarios: generating text with unseen font variation,
e.g., italic and bold, mixing different fonts to construct a new font, and
using more relaxed forms of natural language as the instructions to guide the
generation task. We evaluate our approach on five datasets and demonstrate its
superior performance in terms of text correctness, image naturalness, and style
controllability. Our code is publicly available.
https://github.com/UCSB-NLP-Chang/DiffSTE
Related papers
- TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control [5.3798706094384725]
We propose TextCtrl, a diffusion-based method that edits text with prior guidance control.
Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy.
To fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons.
arXiv Detail & Related papers (2024-10-14T03:50:39Z) - TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles [12.182588762414058]
Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original.
Recent works leverage diffusion models, showing improved results, yet still face challenges.
We present emphTextMastero - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs)
arXiv Detail & Related papers (2024-08-20T08:06:09Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - TextDiffuser-2: Unleashing the Power of Language Models for Text
Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering.
We utilize the language model within the diffusion model to encode the position and texts at the line level.
We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z) - ControlStyle: Text-Driven Stylized Image Generation Using Diffusion
Priors [105.37795139586075]
We propose a new task for stylizing'' text-to-image models, namely text-driven stylized image generation.
We present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network.
Experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results.
arXiv Detail & Related papers (2023-11-09T15:50:52Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Exploring Stroke-Level Modifications for Scene Text Editing [86.33216648792964]
Scene text editing (STE) aims to replace text with the desired one while preserving background and styles of the original text.
Previous methods of editing the whole image have to learn different translation rules of background and text regions simultaneously.
We propose a novel network by MOdifying Scene Text image at strokE Level (MOSTEL)
arXiv Detail & Related papers (2022-12-05T02:10:59Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.