ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models
- URL: http://arxiv.org/abs/2305.16225v3
- Date: Thu, 7 Dec 2023 07:56:52 GMT
- Title: ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models
- Authors: Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang,
Chongyang Ma, Tong-Yee Lee, Oliver Deussen, Changsheng Xu
- Abstract summary: Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
- Score: 77.03361270726944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalizing generative models offers a way to guide image generation with
user-provided references. Current personalization methods can invert an object
or concept into the textual conditioning space and compose new natural
sentences for text-to-image diffusion models. However, representing and editing
specific visual attributes such as material, style, and layout remains a
challenge, leading to a lack of disentanglement and editability. To address
this problem, we propose a novel approach that leverages the step-by-step
generation process of diffusion models, which generate images from low to high
frequency information, providing a new perspective on representing, generating,
and editing images. We develop the Prompt Spectrum Space P*, an expanded
textual conditioning space, and a new image representation method called
\sysname. ProSpect represents an image as a collection of inverted textual
token embeddings encoded from per-stage prompts, where each prompt corresponds
to a specific generation stage (i.e., a group of consecutive steps) of the
diffusion model. Experimental results demonstrate that P* and ProSpect offer
better disentanglement and controllability compared to existing methods. We
apply ProSpect in various personalized attribute-aware image generation
applications, such as image-guided or text-driven manipulations of materials,
style, and layout, achieving previously unattainable results from a single
image input without fine-tuning the diffusion models. Our source code is
available athttps://github.com/zyxElsa/ProSpect.
Related papers
- Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Uncovering the Disentanglement Capability in Text-to-Image Diffusion
Models [60.63556257324894]
A key desired property of image generative models is the ability to disentangle different attributes.
We propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation.
Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms.
arXiv Detail & Related papers (2022-12-16T19:58:52Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.