SEGA: Instructing Text-to-Image Models using Semantic Guidance
- URL: http://arxiv.org/abs/2301.12247v2
- Date: Thu, 2 Nov 2023 18:17:01 GMT
- Title: SEGA: Instructing Text-to-Image Models using Semantic Guidance
- Authors: Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek,
Patrick Schramowski, Kristian Kersting
- Abstract summary: We show how to interact with the diffusion process to flexibly steer it along semantic directions.
SEGA generalizes to any generative architecture using classifier-free guidance.
It allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception.
- Score: 33.080261792998826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models have recently received a lot of interest for
their astonishing ability to produce high-fidelity images from text only.
However, achieving one-shot generation that aligns with the user's intent is
nearly impossible, yet small changes to the input prompt often result in very
different images. This leaves the user with little semantic control. To put the
user in control, we show how to interact with the diffusion process to flexibly
steer it along semantic directions. This semantic guidance (SEGA) generalizes
to any generative architecture using classifier-free guidance. More
importantly, it allows for subtle and extensive edits, changes in composition
and style, as well as optimizing the overall artistic conception. We
demonstrate SEGA's effectiveness on both latent and pixel-based diffusion
models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of
tasks, thus providing strong evidence for its versatility, flexibility, and
improvements over existing methods.
Related papers
- AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing [14.543341303789445]
We propose a novel mask-free point-based image editing method, AdaptiveDrag, which generates images that better align with user intent.
To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization.
Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs.
arXiv Detail & Related papers (2024-10-16T15:59:02Z) - Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models [46.18013380882767]
This work focuses on inverting the diffusion model to obtain interpretable language prompts directly.
We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
arXiv Detail & Related papers (2023-12-19T18:47:30Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - The Stable Artist: Steering Semantics in Diffusion Latent Space [17.119616029527744]
We present the Stable Artist, an image editing approach enabling fine-grained control of the image generation process.
The main component is semantic guidance (SEGA) which steers the diffusion process along variable numbers of semantic directions.
SEGA enables probing of latent spaces to gain insights into the representation of concepts learned by the model.
arXiv Detail & Related papers (2022-12-12T16:21:24Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.