Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition
- URL: http://arxiv.org/abs/2402.15504v1
- Date: Fri, 23 Feb 2024 18:55:09 GMT
- Title: Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition
- Authors: Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma,
Andrew Markham, Niki Trigoni, H.T. Kung, Yubei Chen
- Abstract summary: Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts.
This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
- Score: 47.07564907486087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent text-to-image diffusion models are able to learn and synthesize images
containing novel, personalized concepts (e.g., their own pets or specific
items) with just a few examples for training. This paper tackles two
interconnected issues within this realm of personalizing text-to-image
diffusion models. First, current personalization techniques fail to reliably
extend to multiple concepts -- we hypothesize this to be due to the mismatch
between complex scenes and simple text descriptions in the pre-training dataset
(e.g., LAION). Second, given an image containing multiple personalized
concepts, there lacks a holistic metric that evaluates performance on not just
the degree of resemblance of personalized concepts, but also whether all
concepts are present in the image and whether the image accurately reflects the
overall text description. To address these issues, we introduce Gen4Gen, a
semi-automated dataset creation pipeline utilizing generative models to combine
personalized concepts into complex compositions along with text-descriptions.
Using this, we create a dataset called MyCanvas, that can be used to benchmark
the task of multi-concept personalization. In addition, we design a
comprehensive metric comprising two scores (CP-CLIP and TI-CLIP) for better
quantifying the performance of multi-concept, personalized text-to-image
diffusion methods. We provide a simple baseline built on top of Custom
Diffusion with empirical prompting strategies for future researchers to
evaluate on MyCanvas. We show that by improving data quality and prompting
strategies, we can significantly increase multi-concept personalized image
generation quality, without requiring any modifications to model architecture
or training algorithms.
Related papers
- AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [4.544788024283586]
AttenCraft is an attention-guided method for multiple concept disentanglement.
We introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts.
Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment.
arXiv Detail & Related papers (2024-05-28T08:50:14Z) - FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition [49.2208591663092]
FreeCustom is a tuning-free method to generate customized images of multi-concept composition based on reference concepts.
We introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy.
Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization.
arXiv Detail & Related papers (2024-05-22T17:53:38Z) - Textual Localization: Decomposing Multi-concept Images for
Subject-Driven Text-to-Image Generation [5.107886283951882]
We introduce a localized text-to-image model to handle multi-concept input images.
Our method incorporates a novel cross-attention guidance to decompose multiple concepts.
Notably, our method generates cross-attention maps consistent with the target concept in the generated images.
arXiv Detail & Related papers (2024-02-15T14:19:42Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.