DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven
Text-to-Image Generation
- URL: http://arxiv.org/abs/2305.03374v4
- Date: Tue, 27 Feb 2024 02:45:34 GMT
- Title: DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven
Text-to-Image Generation
- Authors: Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou,
Wenwu Zhu
- Abstract summary: We propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation.
By combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
- Score: 50.39533637201273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subject-driven text-to-image generation aims to generate customized images of
the given subject based on the text descriptions, which has drawn increasing
attention. Existing methods mainly resort to finetuning a pretrained generative
model, where the identity-relevant information (e.g., the boy) and the
identity-irrelevant information (e.g., the background or the pose of the boy)
are entangled in the latent embedding space. However, the highly entangled
latent embedding may lead to the failure of subject-driven text-to-image
generation as follows: (i) the identity-irrelevant information hidden in the
entangled embedding may dominate the generation process, resulting in the
generated images heavily dependent on the irrelevant information while ignoring
the given text descriptions; (ii) the identity-relevant information carried in
the entangled embedding can not be appropriately preserved, resulting in
identity change of the subject in the generated images. To tackle the problems,
we propose DisenBooth, an identity-preserving disentangled tuning framework for
subject-driven text-to-image generation. Specifically, DisenBooth finetunes the
pretrained diffusion model in the denoising process. Different from previous
works that utilize an entangled embedding to denoise each image, DisenBooth
instead utilizes disentangled embeddings to respectively preserve the subject
identity and capture the identity-irrelevant information. We further design the
novel weak denoising and contrastive embedding auxiliary tuning objectives to
achieve the disentanglement. Extensive experiments show that our proposed
DisenBooth framework outperforms baseline models for subject-driven
text-to-image generation with the identity-preserved embedding. Additionally,
by combining the identity-preserved embedding and identity-irrelevant
embedding, DisenBooth demonstrates more generation flexibility and
controllability
Related papers
- EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance [20.430259028981094]
Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image.
The challenge lies in preserving the subject's identity while aligning with the text prompt which often requires modifying certain aspects of the subject's appearance.
Key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) separating text and subject guidance is crucial for both text alignment and identity preservation.
arXiv Detail & Related papers (2024-09-12T14:44:45Z) - Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process.
We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z) - Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization [23.04290567321589]
A surge of text-to-image (T2I) models and their customization methods generate new images of a user-provided subject.
These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance.
We propose visual embedding which effectively harmonizes with the given textual embedding.
We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap.
arXiv Detail & Related papers (2024-03-21T06:03:51Z) - Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization.
We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information.
We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved
Personalization [92.90392834835751]
PortraitBooth is designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation.
PortraitBooth eliminates computational overhead and mitigates identity distortion.
It incorporates emotion-aware cross-attention control for diverse facial expressions in generated images.
arXiv Detail & Related papers (2023-12-11T13:03:29Z) - When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for
Personalized Image Generation [60.305112612629465]
Text-to-image diffusion models have excelled in producing diverse, high-quality, and photo-realistic images.
We present a novel use of the extended StyleGAN embedding space $mathcalW_+$ to achieve enhanced identity preservation and disentanglement for diffusion models.
Our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions.
arXiv Detail & Related papers (2023-11-29T09:05:14Z) - HFORD: High-Fidelity and Occlusion-Robust De-identification for Face
Privacy Protection [60.63915939982923]
Face de-identification is a practical way to solve the identity protection problem.
The existing facial de-identification methods have revealed several problems.
We present a High-Fidelity and Occlusion-Robust De-identification (HFORD) method to deal with these issues.
arXiv Detail & Related papers (2023-11-15T08:59:02Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.