Box It to Bind It: Unified Layout Control and Attribute Binding in T2I
Diffusion Models
- URL: http://arxiv.org/abs/2402.17910v1
- Date: Tue, 27 Feb 2024 21:51:32 GMT
- Title: Box It to Bind It: Unified Layout Control and Attribute Binding in T2I
Diffusion Models
- Authors: Ashkan Taghipour, Morteza Ghahremani, Mohammed Bennamoun, Aref Miri
Rekavandi, Hamid Laga, and Farid Boussaid
- Abstract summary: Box-it-to-Bind-it (B2B) is a training-free approach for improving spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute binding, and layout guidance.
B2B is designed as a compatible plug-and-play module for existing T2I models.
- Score: 28.278822620442774
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While latent diffusion models (LDMs) excel at creating imaginative images,
they often lack precision in semantic fidelity and spatial control over where
objects are generated. To address these deficiencies, we introduce the
Box-it-to-Bind-it (B2B) module - a novel, training-free approach for improving
spatial control and semantic accuracy in text-to-image (T2I) diffusion models.
B2B targets three key challenges in T2I: catastrophic neglect, attribute
binding, and layout guidance. The process encompasses two main steps: i) Object
generation, which adjusts the latent encoding to guarantee object generation
and directs it within specified bounding boxes, and ii) attribute binding,
guaranteeing that generated objects adhere to their specified attributes in the
prompt. B2B is designed as a compatible plug-and-play module for existing T2I
models, markedly enhancing model performance in addressing the key challenges.
We evaluate our technique using the established CompBench and TIFA score
benchmarks, demonstrating significant performance improvements compared to
existing methods. The source code will be made publicly available at
https://github.com/nextaistudio/BoxIt2BindIt.
Related papers
- Boundary Attention Constrained Zero-Shot Layout-To-Image Generation [47.435234391588494]
Recent text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting.
We propose a novel zero-shot L2I approach, BACON, which eliminates the need for additional modules or fine-tuning.
We leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features.
arXiv Detail & Related papers (2024-11-15T05:44:45Z) - Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis [98.21700880115938]
Text-to-image (T2I) models often fail to accurately bind semantically related objects or attributes in the input prompts.
We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token.
arXiv Detail & Related papers (2024-11-11T17:05:15Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions [21.371773126590874]
We show that there exist directions in the commonly used token-level CLIP text embeddings that enable fine-grained subject-specific control of high-level attributes in text-to-image models.
We introduce one efficient optimization-free and one robust optimization-based method to identify these directions for specific attributes from contrastive text prompts.
arXiv Detail & Related papers (2024-03-25T18:00:42Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models [43.62338454684645]
We study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information.
We propose a pluggable interaction control model, called InteractDiffusion, that extends existing pre-trained T2I diffusion models.
Our model attains the ability to control the interaction and location on existing T2I diffusion models.
arXiv Detail & Related papers (2023-12-10T10:35:16Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.