Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation
- URL: http://arxiv.org/abs/2402.09966v1
- Date: Thu, 15 Feb 2024 14:19:42 GMT
- Title: Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation
- Authors: Junjie Shentu, Matthew Watson, Noura Al Moubayed
- Abstract summary: We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
- Score: 5.107886283951882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subject-driven text-to-image diffusion models empower users to tailor the
model to new concepts absent in the pre-training dataset using a few sample
images. However, prevalent subject-driven models primarily rely on
single-concept input images, facing challenges in specifying the target concept
when dealing with multi-concept input images. To this end, we introduce a
textual localized text-to-image model (Texual Localization) to handle
multi-concept input images. During fine-tuning, our method incorporates a novel
cross-attention guidance to decompose multiple concepts, establishing distinct
connections between the visual representation of the target concept and the
identifier token in the text prompt. Experimental results reveal that our
method outperforms or performs comparably to the baseline models in terms of
image fidelity and image-text alignment on multi-concept input images. In
comparison to Custom Diffusion, our method with hard guidance achieves CLIP-I
scores that are 7.04%, 8.13% higher and CLIP-T scores that are 2.22%, 5.85%
higher in single-concept and multi-concept generation, respectively. Notably,
our method generates cross-attention maps consistent with the target concept in
the generated images, a capability absent in existing models.
Related papers
- FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition [49.2208591663092]
FreeCustom is a tuning-free method to generate customized images of multi-concept composition based on reference concepts.
We introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy.
Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization.
arXiv Detail & Related papers (2024-05-22T17:53:38Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models [85.14042557052352]
We introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time.
We show that Concept Weaver can generate multiple custom concepts with higher identity fidelity compared to alternative approaches.
arXiv Detail & Related papers (2024-04-05T06:41:27Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.
Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.
However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.
We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - NEUCORE: Neural Concept Reasoning for Composed Image Retrieval [16.08214739525615]
We propose a NEUral COncept REasoning model which incorporates multi-modal concept alignment and progressive multimodal fusion over aligned concepts.
Our proposed approach is evaluated on three datasets and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-10-02T17:21:25Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.