Decoupled Textual Embeddings for Customized Image Generation
- URL: http://arxiv.org/abs/2312.11826v1
- Date: Tue, 19 Dec 2023 03:32:10 GMT
- Title: Decoupled Textual Embeddings for Customized Image Generation
- Authors: Yufei Cai, Yuxiang Wei, Zhilong Ji, Jinfeng Bai, Hu Han and Wangmeng
Zuo
- Abstract summary: Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
- Score: 62.98933630971543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Customized text-to-image generation, which aims to learn user-specified
concepts with a few images, has drawn significant attention recently. However,
existing methods usually suffer from overfitting issues and entangle the
subject-unrelated information (e.g., background and pose) with the learned
concept, limiting the potential to compose concept into new scenes. To address
these issues, we propose the DETEX, a novel approach that learns the
disentangled concept embedding for flexible customized text-to-image
generation. Unlike conventional methods that learn a single concept embedding
from the given images, our DETEX represents each image using multiple word
embeddings during training, i.e., a learnable image-shared subject embedding
and several image-specific subject-unrelated embeddings. To decouple irrelevant
attributes (i.e., background and pose) from the subject embedding, we further
present several attribute mappers that encode each image as several
image-specific subject-unrelated embeddings. To encourage these unrelated
embeddings to capture the irrelevant information, we incorporate them with
corresponding attribute words and propose a joint training strategy to
facilitate the disentanglement. During inference, we only use the subject
embedding for image generation, while selectively using image-specific
embeddings to retain image-specified attributes. Extensive experiments
demonstrate that the subject embedding obtained by our method can faithfully
represent the target concept, while showing superior editability compared to
the state-of-the-art methods. Our code will be made published available.
Related papers
- Attention Calibration for Disentangled Text-to-Image Personalization [12.339742346826403]
We propose an attention calibration mechanism to improve the concept-level understanding of the T2I model.
We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-03-27T13:31:39Z) - Tuning-Free Image Customization with Image and Text Guidance [65.9504243633169]
We introduce a tuning-free framework for simultaneous text-image-guided image customization.
Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions.
Our approach outperforms previous methods in both human and quantitative evaluations.
arXiv Detail & Related papers (2024-03-19T11:48:35Z) - Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition [47.07564907486087]
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts.
This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
arXiv Detail & Related papers (2024-02-23T18:55:09Z) - Visual Concept-driven Image Generation with Text-to-Image Diffusion Model [65.96212844602866]
Text-to-image (TTI) models have demonstrated impressive results in generating high-resolution images of complex scenes.
Recent approaches have extended these methods with personalization techniques that allow them to integrate user-illustrated concepts.
However, the ability to generate images with multiple interacting concepts, such as human subjects, as well as concepts that may be entangled in one, or across multiple, image illustrations remains illusive.
We propose a concept-driven TTI personalization framework that addresses these core challenges.
arXiv Detail & Related papers (2024-02-18T07:28:37Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image
Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods.
The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z) - Unsupervised Compositional Concepts Discovery with Text-to-Image
Generative Models [80.75258849913574]
In this paper, we consider the inverse problem -- given a collection of different images, can we discover the generative concepts that represent each image?
We present an unsupervised approach to discover generative concepts from a collection of images, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images.
arXiv Detail & Related papers (2023-06-08T17:02:15Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Learning Multimodal Affinities for Textual Editing in Images [18.7418059568887]
We devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image.
We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups.
We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations.
arXiv Detail & Related papers (2021-03-18T10:09:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.