Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
- URL: http://arxiv.org/abs/2402.11487v2
- Date: Wed, 17 Jul 2024 01:47:16 GMT
- Title: Visual Concept-driven Image Generation with Text-to-Image Diffusion Model
- Authors: Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal,
- Abstract summary: Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.
Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.
However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.
We propose a concept-driven TTI personalization framework that addresses these core challenges.
- Score: 65.96212844602866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive. In this work, we propose a concept-driven TTI personalization framework that addresses these core challenges. We build on existing works that learn custom tokens for user-illustrated concepts, allowing those to interact with existing text tokens in the TTI model. However, importantly, to disentangle and better learn the concepts in question, we jointly learn (latent) segmentation masks that disentangle these concepts in user-provided image illustrations. We do so by introducing an Expectation Maximization (EM)-like optimization procedure where we alternate between learning the custom tokens and estimating (latent) masks encompassing corresponding concepts in user-supplied images. We obtain these masks based on cross-attention, from within the U-Net parameterized latent diffusion model and subsequent DenseCRF optimization. We illustrate that such joint alternating refinement leads to the learning of better tokens for concepts and, as a by-product, latent masks. We illustrate the benefits of the proposed approach qualitatively and quantitatively with several examples and use cases that can combine three or more entangled concepts.
Related papers
- IP-Composer: Semantic Composition of Visual Concepts [49.18472621931207]
We present IP-Composer, a training-free approach for compositional image generation.
Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding.
We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text.
arXiv Detail & Related papers (2025-02-19T18:49:31Z) - MCM: Multi-layer Concept Map for Efficient Concept Learning from Masked Images [5.09981114473162]
We propose Multi-layer Concept Map (MCM), the first work to devise an efficient concept learning method based on masked images.
In particular, we introduce an asymmetric concept learning architecture by establishing correlations between different encoder and decoder layers.
MCM significantly reduces computational costs by training on fewer than 75% of the total image patches.
arXiv Detail & Related papers (2025-02-01T01:45:49Z) - AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [4.544788024283586]
AttenCraft is an attention-guided method for multiple concept disentanglement.
We introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts.
Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment.
arXiv Detail & Related papers (2024-05-28T08:50:14Z) - FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition [49.2208591663092]
FreeCustom is a tuning-free method to generate customized images of multi-concept composition based on reference concepts.
We introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy.
Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization.
arXiv Detail & Related papers (2024-05-22T17:53:38Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Multi-Concept Customization of Text-to-Image Diffusion [51.8642043743222]
We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models.
We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts.
Our model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings.
arXiv Detail & Related papers (2022-12-08T18:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.