Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment
- URL: http://arxiv.org/abs/2308.15854v3
- Date: Sat, 04 Jan 2025 08:24:37 GMT
- Title: Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment
- Authors: Zhanbo Feng, Zenan Ling, Xinyu Lu, Ci Gong, Feng Zhou, Wugedele Bao, Jie Li, Fan Yang, Robert C. Qiu,
- Abstract summary: We propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a textitfrozen pre-trained diffusion model.<n>Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt.
- Score: 10.82748329166797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic consistency, or text-guided methods, which ensure alignment with the text guidance but compromise visual quality. To resolve this issue, we propose a framework that integrates a fusion of generated visual references and text guidance into the semantic latent space of a \textit{frozen} pre-trained diffusion model. Using only a tiny neural network, our framework provides control over diverse content and attributes, driven intuitively by the text prompt. Compared to state-of-the-art methods, the framework generates images of higher quality while providing realistic editing effects across various benchmark datasets.
Related papers
- Textualize Visual Prompt for Image Editing via Diffusion Bridge [15.696208035498753]
Current visual prompt methods rely on a pretrained text-guided image-to-image generative model.
We present a framework based on any single text-to-image model without reliance on the explicit image-to-image model.
arXiv Detail & Related papers (2025-01-07T03:33:22Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - ControlStyle: Text-Driven Stylized Image Generation Using Diffusion
Priors [105.37795139586075]
We propose a new task for stylizing'' text-to-image models, namely text-driven stylized image generation.
We present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network.
Experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results.
arXiv Detail & Related papers (2023-11-09T15:50:52Z) - Dense Text-to-Image Generation with Attention Modulation [49.287458275920514]
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions.
We propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions.
We achieve similar-quality visual results with models specifically trained with layout conditions.
arXiv Detail & Related papers (2023-08-24T17:59:01Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment [112.45442468794658]
We propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff.
In the coarse semantic re-alignment phase, a novel caption reward is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt.
The fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images from a local semantic view.
arXiv Detail & Related papers (2023-05-31T06:59:21Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z) - Prompt-to-Prompt Image Editing with Cross Attention Control [41.26939787978142]
We present an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only.
We show our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
arXiv Detail & Related papers (2022-08-02T17:55:41Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - Blended Diffusion for Text-driven Editing of Natural Images [18.664733153082146]
We introduce the first solution for performing local (region-based) edits in generic natural images.
We achieve our goal by leveraging and combining a pretrained language-image model (CLIP)
To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent.
arXiv Detail & Related papers (2021-11-29T18:58:49Z) - StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [71.1862388442953]
We develop a text-based interface for StyleGAN image manipulation.
We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt.
Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation.
arXiv Detail & Related papers (2021-03-31T17:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.