Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions
- URL: http://arxiv.org/abs/2403.17064v1
- Date: Mon, 25 Mar 2024 18:00:42 GMT
- Title: Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions
- Authors: Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Vincent Tao Hu, Björn Ommer,
- Abstract summary: We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models.
We introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts.
- Score: 21.371773126590874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, advances in text-to-image (T2I) diffusion models have substantially elevated the quality of their generated images. However, achieving fine-grained control over attributes remains a challenge due to the limitations of natural language prompts (such as no continuous set of intermediate descriptions existing between ``person'' and ``old person''). Even though many methods were introduced that augment the model or generation process to enable such control, methods that do not require a fixed reference image are limited to either enabling global fine-grained attribute expression control or coarse attribute expression control localized to specific subjects, not both simultaneously. We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models. Based on this observation, we introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts. We demonstrate that these directions can be used to augment the prompt text input with fine-grained control over attributes of specific subjects in a compositional manner (control over multiple attributes of a single subject) without having to adapt the diffusion model. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.
Related papers
- PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control [24.569528214869113]
StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation.
This work uses the disentangled $mathcalW+$ space of StyleGANs to condition the T2I model.
We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
arXiv Detail & Related papers (2024-07-24T07:10:25Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion
Models [68.47333676663312]
We show a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models.
The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens.
We illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
arXiv Detail & Related papers (2024-02-21T03:01:17Z) - Air-Decoding: Attribute Distribution Reconstruction for Decoding-Time
Controllable Text Generation [58.911255139171075]
Controllable text generation (CTG) aims to generate text with desired attributes.
We propose a novel lightweight decoding framework named Air-Decoding.
Our method achieves a new state-of-the-art control performance.
arXiv Detail & Related papers (2023-10-23T12:59:11Z) - Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models [82.19740045010435]
We introduce Uni-ControlNet, a unified framework that allows for the simultaneous utilization of different local controls and global controls.
Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models.
Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability.
arXiv Detail & Related papers (2023-05-25T17:59:58Z) - Compositional Text-to-Image Synthesis with Attention Map Control of
Diffusion Models [8.250234707160793]
Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts.
They fail to semantically align the generated images with the prompts due to their limited compositional capabilities.
We propose a novel attention mask control strategy based on predicted object boxes to address these issues.
arXiv Detail & Related papers (2023-05-23T10:49:22Z) - DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for
Controllable Text Generation [6.844825905212349]
We propose a new CTG approach, namely DisCup, which incorporates the attribute knowledge of discriminator to optimize the control-prompts.
DisCup can achieve a new state-of-the-art control performance while maintaining an efficient and high-quality text generation, only relying on around 10 virtual tokens.
arXiv Detail & Related papers (2022-10-18T02:59:06Z) - Attribute-specific Control Units in StyleGAN for Fine-grained Image
Manipulation [57.99007520795998]
We discover attribute-specific control units, which consist of multiple channels of feature maps and modulation styles.
Specifically, we collaboratively manipulate the modulation style channels and feature maps in control units to obtain the semantic and spatial disentangled controls.
We move the modulation style along a specific sparse direction vector and replace the filter-wise styles used to compute the feature maps to manipulate these control units.
arXiv Detail & Related papers (2021-11-25T10:42:10Z) - Controllable Dialogue Generation with Disentangled Multi-grained Style
Specification and Attribute Consistency Reward [47.96949534259019]
We propose a controllable dialogue generation model to steer response generation under multi-attribute constraints.
We categorize the commonly used control attributes into global and local ones, which possess different granularities of effects on response generation.
Our model can significantly outperform competitive baselines in terms of response quality, content diversity and controllability.
arXiv Detail & Related papers (2021-09-14T14:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.