PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control
- URL: http://arxiv.org/abs/2408.05083v1
- Date: Wed, 24 Jul 2024 07:10:25 GMT
- Title: PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control
- Authors: Rishubh Parihar, Sachidanand VS, Sabariswaran Mani, Tejan Karmali, R. Venkatesh Babu,
- Abstract summary: StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation.
This work uses the disentangled $mathcalW+$ space of StyleGANs to condition the T2I model.
We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
- Score: 24.569528214869113
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $\mathcal{W+}$ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the $\mathcal{W+}$ space, we train a latent mapper to translate latent codes from $\mathcal{W+}$ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
Related papers
- Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions [21.371773126590874]
We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models.
We introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts.
arXiv Detail & Related papers (2024-03-25T18:00:42Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Beyond Inserting: Learning Identity Embedding for Semantic-Fidelity Personalized Diffusion Generation [21.739328335601716]
This paper focuses on inserting accurate and interactive ID embedding into the Stable Diffusion Model for personalized generation.
We propose a face-wise attention loss to fit the face region instead of entangling ID-unrelated information, such as face layout and background.
Our results exhibit superior ID accuracy, text-based manipulation ability, and generalization compared to previous methods.
arXiv Detail & Related papers (2024-01-31T11:52:33Z) - DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing [14.537856326925178]
We focus on exploring explicit fine-grained control of generative facial image editing.
We propose a novel diffusion-based editing framework, named DisControlFace.
Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data.
arXiv Detail & Related papers (2023-12-11T08:16:55Z) - SingleInsert: Inserting New Concepts from a Single Image into
Text-to-Image Models for Flexible Editing [59.3017821001455]
SingleInsert is an image-to-text (I2T) inversion method with single source images containing the same concept.
In this work, we propose a simple and effective baseline for single-image I2T inversion, named SingleInsert.
With the proposed techniques, SingleInsert excels in single concept generation with high visual fidelity while allowing flexible editing.
arXiv Detail & Related papers (2023-10-12T07:40:39Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - ManiCLIP: Multi-Attribute Face Manipulation from Text [104.30600573306991]
We present a novel multi-attribute face manipulation method based on textual descriptions.
Our method generates natural manipulated faces with minimal text-irrelevant attribute editing.
arXiv Detail & Related papers (2022-10-02T07:22:55Z) - S2FGAN: Semantically Aware Interactive Sketch-to-Face Translation [11.724779328025589]
This paper proposes a sketch-to-image generation framework called S2FGAN.
We employ two latent spaces to control the face appearance and adjust the desired attributes of the generated face.
Our method successfully outperforms state-of-the-art methods on attribute manipulation by exploiting greater control of attribute intensity.
arXiv Detail & Related papers (2020-11-30T13:42:39Z) - PIE: Portrait Image Embedding for Semantic Control [82.69061225574774]
We present the first approach for embedding real portrait images in the latent space of StyleGAN.
We use StyleRig, a pretrained neural network that maps the control space of a 3D morphable face model to the latent space of the GAN.
An identity energy preservation term allows spatially coherent edits while maintaining facial integrity.
arXiv Detail & Related papers (2020-09-20T17:53:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.