Training-Free Structured Diffusion Guidance for Compositional
Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2212.05032v1
- Date: Fri, 9 Dec 2022 18:30:24 GMT
- Title: Training-Free Structured Diffusion Guidance for Compositional
Text-to-Image Synthesis
- Authors: Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula,
Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang
- Abstract summary: Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks.
We improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.
- Score: 78.28620571530706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale diffusion models have achieved state-of-the-art results on
text-to-image synthesis (T2I) tasks. Despite their ability to generate
high-quality yet creative images, we observe that attribution-binding and
compositional capabilities are still considered major challenging issues,
especially when involving multiple objects. In this work, we improve the
compositional skills of T2I models, specifically more accurate attribute
binding and better image compositions. To do this, we incorporate linguistic
structures with the diffusion guidance process based on the controllable
properties of manipulating cross-attention layers in diffusion-based T2I
models. We observe that keys and values in cross-attention layers have strong
semantic meanings associated with object layouts and content. Therefore, we can
better preserve the compositional semantics in the generated image by
manipulating the cross-attention representations based on linguistic insights.
Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention
design is efficient that requires no additional training samples. We achieve
better compositional skills in qualitative and quantitative results, leading to
a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an
in-depth analysis to reveal potential causes of incorrect image compositions
and justify the properties of cross-attention layers in the generation process.
Related papers
- Progressive Compositionality In Text-to-Image Generative Models [33.18510121342558]
We propose EvoGen, a new curriculum for contrastive learning of diffusion models.
In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios.
We also harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair.
arXiv Detail & Related papers (2024-10-22T05:59:29Z) - CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization [27.114395240088562]
We argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning.
Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning paradigm and a Multimodal Feature Injection (MFI)
Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
arXiv Detail & Related papers (2024-09-09T13:39:47Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion
Models [58.46926334842161]
This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps.
We propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores.
Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability.
arXiv Detail & Related papers (2023-12-10T22:07:42Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.