When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation
- URL: http://arxiv.org/abs/2311.17461v1
- Date: Wed, 29 Nov 2023 09:05:14 GMT
- Title: When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation
- Authors: Xiaoming Li, Xinyu Hou, Chen Change Loy
- Abstract summary: Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images.
We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models.
Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
- Score: 60.305112612629465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models have remarkably excelled in producing diverse,
high-quality, and photo-realistic images. This advancement has spurred a
growing interest in incorporating specific identities into generated content.
Most current methods employ an inversion approach to embed a target visual
concept into the text embedding space using a single reference image. However,
the newly synthesized faces either closely resemble the reference image in
terms of facial attributes, such as expression, or exhibit a reduced capacity
for identity preservation. Text descriptions intended to guide the facial
attributes of the synthesized face may fall short, owing to the intricate
entanglement of identity information with identity-irrelevant facial attributes
derived from the reference image. To address these issues, we present the novel
use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve
enhanced identity preservation and disentanglement for diffusion models. By
aligning this semantically meaningful human face latent space with
text-to-image diffusion models, we succeed in maintaining high fidelity in
identity preservation, coupled with the capacity for semantic editing.
Additionally, we propose new training objectives to balance the influences of
both prompt and identity conditions, ensuring that the identity-irrelevant
background remains unaffected during facial attribute modifications. Extensive
experiments reveal that our method adeptly generates personalized text-to-image
outputs that are not only compatible with prompt descriptions but also amenable
to common StyleGAN editing directions in diverse settings. Our source code will
be available at \url{https://github.com/csxmli2016/w-plus-adapter}.
Related papers
- Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - FlashFace: Human Image Personalization with High-fidelity Identity Preservation [59.76645602354481]
FlashFace allows users to easily personalize their own photos by providing one or a few reference face images and a text prompt.
Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following.
arXiv Detail & Related papers (2024-03-25T17:59:57Z) - Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization.
We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information.
We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z) - Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation.
We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background.
Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z) - StableIdentity: Inserting Anybody into Anywhere at First Sight [57.99693188913382]
We propose StableIdentity, which allows identity-consistent recontextualization with just one face image.
We are the first to directly inject the identity learned from a single image into video/3D generation without finetuning.
arXiv Detail & Related papers (2024-01-29T09:06:15Z) - Personalized Face Inpainting with Diffusion Models by Parallel Visual
Attention [55.33017432880408]
This paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models to improve inpainting results.
We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting.
Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks.
arXiv Detail & Related papers (2023-12-06T15:39:03Z) - DreamIdentity: Improved Editability for Efficient Face-identity
Preserved Image Generation [69.16517915592063]
We propose a novel face-identity encoder to learn an accurate representation of human faces.
We also propose self-augmented editability learning to enhance the editability of models.
Our methods can generate identity-preserved images under different scenes at a much faster speed.
arXiv Detail & Related papers (2023-07-01T11:01:17Z) - DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven
Text-to-Image Generation [50.39533637201273]
We propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation.
By combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
arXiv Detail & Related papers (2023-05-05T09:08:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.