Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
- URL: http://arxiv.org/abs/2403.16954v2
- Date: Fri, 17 Jan 2025 11:42:58 GMT
- Title: Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance
- Authors: Jingyuan Zhu, Huimin Ma, Jiansheng Chen, Jian Yuan,
- Abstract summary: This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes.<n>We propose to bind each attachment to corresponding subjects separately with split text prompts.<n>We then isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference.
- Score: 19.221431052643222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-image diffusion models have achieved great success in synthesizing high-quality and diverse images given target text prompts. Despite the revolutionary image generation ability, current state-of-the-art models still struggle to deal with multi-concept generation accurately in many cases. This phenomenon is known as ``concept bleeding" and displays as the unexpected overlapping or merging of various concepts. This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes, pursuing better text-image consistency. The core idea is to isolate the synthesizing processes of different concepts. We propose to bind each attachment to corresponding subjects separately with split text prompts. Besides, we introduce a revision method to fix the concept bleeding problem in multi-subject synthesis. We first depend on pre-trained object detection and segmentation models to obtain the layouts of subjects. Then we isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference. Overall, we achieve a training-free strategy, named Isolated Diffusion, to optimize multi-concept text-to-image synthesis. It is compatible with the latest Stable Diffusion XL (SDXL) and prior Stable Diffusion (SD) models. We compare our approach with alternative methods using a variety of multi-concept text prompts and demonstrate its effectiveness with clear advantages in text-image consistency and user study.
Related papers
- MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Financial Models in Generative Art: Black-Scholes-Inspired Concept Blending in Text-to-Image Diffusion [57.03116054807942]
We introduce a novel approach for concept blending in pretrained text-to-image diffusion models.
We derive a robust algorithm for concept blending that capitalizes on the Markovian dynamics of the Black-Scholes framework.
Our work shows that financially inspired techniques can enhance text-to-image concept blending in generative AI.
arXiv Detail & Related papers (2024-05-22T14:25:57Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - The Hidden Language of Diffusion Models [70.03691458189604]
We present Conceptor, a novel method to interpret the internal representation of a textual concept by a diffusion model.
We find surprising visual connections between concepts, that transcend their textual semantics.
We additionally discover concepts that rely on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous fusion of multiple meanings.
arXiv Detail & Related papers (2023-06-01T17:57:08Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Recurrent Affine Transformation for Text-to-image Synthesis [5.256132101498471]
Existing methods usually adaptively fuse suitable text information into the synthesis process with isolated fusion blocks.
We propose a Recurrent Affine Transformation (RAT) for Generative Adrial Networks that connects all the fusion blocks with a recurrent neural network to model their long-term dependency.
Being aware of matching image regions, text descriptions supervise the generator to synthesize more relevant image contents.
arXiv Detail & Related papers (2022-04-22T03:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.