DiffUTE: Universal Text Editing Diffusion Model
- URL: http://arxiv.org/abs/2305.10825v3
- Date: Wed, 18 Oct 2023 05:43:19 GMT
- Title: DiffUTE: Universal Text Editing Diffusion Model
- Authors: Haoxing Chen and Zhuoer Xu and Zhangxuan Gu and Jun Lan and Xing Zheng
and Yaohui Li and Changhua Meng and Huijia Zhu and Weiqiang Wang
- Abstract summary: We propose a universal self-supervised text editing diffusion model (DiffUTE)
It aims to replace or modify words in the source image with another one while maintaining its realistic appearance.
Our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity.
- Score: 32.384236053455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion model based language-guided image editing has achieved great
success recently. However, existing state-of-the-art diffusion models struggle
with rendering correct text and text style during generation. To tackle this
problem, we propose a universal self-supervised text editing diffusion model
(DiffUTE), which aims to replace or modify words in the source image with
another one while maintaining its realistic appearance. Specifically, we build
our model on a diffusion model and carefully modify the network structure to
enable the model for drawing multilingual characters with the help of glyph and
position information. Moreover, we design a self-supervised learning framework
to leverage large amounts of web data to improve the representation ability of
the model. Experimental results show that our method achieves an impressive
performance and enables controllable editing on in-the-wild images with high
fidelity. Our code will be avaliable in
\url{https://github.com/chenhaoxing/DiffUTE}.
Related papers
- Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Reason out Your Layout: Evoking the Layout Master from Large Language
Models for Text-to-Image Synthesis [47.27044390204868]
We introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators.
Our experiments demonstrate significant improvements in image quality and layout accuracy.
arXiv Detail & Related papers (2023-11-28T14:51:13Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - PRedItOR: Text Guided Image Editing with Diffusion Prior [2.3022070933226217]
Text guided image editing requires compute intensive optimization of text embeddings or fine-tuning the model weights for text guided image editing.
Our architecture consists of a diffusion prior model that generates CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion Model trained to generate images conditioned on CLIP image embedding.
We combine this with structure preserving edits on the image decoder using existing approaches such as reverse DDIM to perform text guided image editing.
arXiv Detail & Related papers (2023-02-15T22:58:11Z) - SINE: SINgle Image Editing with Text-to-Image Diffusion Models [10.67527134198167]
This work aims to address the problem of single-image editing.
We propose a novel model-based guidance built upon the classifier-free guidance.
We show promising editing capabilities, including changing style, content addition, and object manipulation.
arXiv Detail & Related papers (2022-12-08T18:57:13Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.