Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation
- URL: http://arxiv.org/abs/2402.00631v2
- Date: Fri, 22 Mar 2024 14:00:37 GMT
- Title: Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation
- Authors: Yang Li, Songlin Yang, Wei Wang, Jing Dong,
- Abstract summary: This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation.
We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background.
Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
- Score: 21.739328335601716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced diffusion-based Text-to-Image (T2I) models, such as the Stable Diffusion Model, have made significant progress in generating diverse and high-quality images using text prompts alone. However, when non-famous users require personalized image generation for their identities (IDs), the T2I models fail to accurately generate their ID-related images. The main problem is that pre-trained T2I models do not learn the mapping between the new ID prompts and their corresponding visual content. The previous methods either failed to accurately fit the face region or lost the interactive generative ability with other existing concepts in T2I models. In other words, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (``Eiffel Tower''), actions (``holding a basketball''), and facial attributes (``eyes closed''). In this paper, we focus on inserting accurate and interactive ID embedding into the Stable Diffusion Model for semantic-fidelity personalized generation. We address this challenge from two perspectives: face-wise region fitting and semantic-fidelity token optimization. Specifically, we first visualize the attention overfit problem and propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background. This key trick significantly enhances the ID accuracy and interactive generative ability with other existing concepts. Then, we optimize one ID representation as multiple per-stage tokens where each token contains two disentangled features. This expansion of the textual conditioning space improves semantic-fidelity control. Extensive experiments validate that our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
Related papers
- Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning [57.91881829308395]
Identity-preserving text-to-image generation (ID-T2I) has received significant attention due to its wide range of application scenarios like AI portrait and advertising.
We present textbfID-Aligner, a general feedback learning framework to enhance ID-T2I performance.
arXiv Detail & Related papers (2024-04-23T18:41:56Z) - Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization.
We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information.
We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z) - Arc2Face: A Foundation Model for ID-Consistent Human Faces [95.00331107591859]
Arc2Face is an identity-conditioned face foundation model.
It can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models.
arXiv Detail & Related papers (2024-03-18T10:32:51Z) - Face2Diffusion for Fast and Editable Face Personalization [33.65484538815936]
We propose Face2Diffusion (F2D) for high-editability face personalization.
The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem.
F2D consists of the following three novel components.
arXiv Detail & Related papers (2024-03-08T06:46:01Z) - StableIdentity: Inserting Anybody into Anywhere at First Sight [57.99693188913382]
We propose StableIdentity, which allows identity-consistent recontextualization with just one face image.
We are the first to directly inject the identity learned from a single image into video/3D generation without finetuning.
arXiv Detail & Related papers (2024-01-29T09:06:15Z) - Personalized Face Inpainting with Diffusion Models by Parallel Visual
Attention [55.33017432880408]
This paper proposes the use of Parallel Visual Attention (PVA) in conjunction with diffusion models to improve inpainting results.
We train the added attention modules and identity encoder on CelebAHQ-IDI, a dataset proposed for identity-preserving face inpainting.
Experiments demonstrate that PVA attains unparalleled identity resemblance in both face inpainting and face inpainting with language guidance tasks.
arXiv Detail & Related papers (2023-12-06T15:39:03Z) - When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images.
We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models.
Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z) - Attribute-preserving Face Dataset Anonymization via Latent Code
Optimization [64.4569739006591]
We present a task-agnostic anonymization procedure that directly optimize the images' latent representation in the latent space of a pre-trained GAN.
We demonstrate through a series of experiments that our method is capable of anonymizing the identity of the images whilst -- crucially -- better-preserving the facial attributes.
arXiv Detail & Related papers (2023-03-20T17:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.