CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image
Personalization
- URL: http://arxiv.org/abs/2311.14631v2
- Date: Thu, 30 Nov 2023 14:42:07 GMT
- Title: CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image
Personalization
- Authors: Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao
- Abstract summary: CatVersion is an inversion-based method that learns the personalized concept through a handful of examples.
Users can utilize text prompts to generate images that embody the personalized concept.
- Score: 56.892032386104006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose CatVersion, an inversion-based method that learns the personalized
concept through a handful of examples. Subsequently, users can utilize text
prompts to generate images that embody the personalized concept, thereby
achieving text-to-image personalization. In contrast to existing approaches
that emphasize word embedding learning or parameter fine-tuning for the
diffusion model, which potentially causes concept dilution or overfitting, our
method concatenates embeddings on the feature-dense space of the text encoder
in the diffusion model to learn the gap between the personalized concept and
its base class, aiming to maximize the preservation of prior knowledge in
diffusion models while restoring the personalized concepts. To this end, we
first dissect the text encoder's integration in the image generation process to
identify the feature-dense space of the encoder. Afterward, we concatenate
embeddings on the Keys and Values in this space to learn the gap between the
personalized concept and its base class. In this way, the concatenated
embeddings ultimately manifest as a residual on the original attention output.
To more accurately and unbiasedly quantify the results of personalized image
generation, we improve the CLIP image alignment score based on masks.
Qualitatively and quantitatively, CatVersion helps to restore personalization
concepts more faithfully and enables more robust editing.
Related papers
- Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models [39.46152582128077]
In the real world, a user may wish to personalize a model on multiple concepts but one at a time.
Most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones.
We propose regularizing the parameter-space and function-space of text-to-image diffusion models.
arXiv Detail & Related papers (2024-10-01T13:54:29Z) - Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition [47.07564907486087]
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts.
This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
arXiv Detail & Related papers (2024-02-23T18:55:09Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - A Neural Space-Time Representation for Text-to-Image Personalization [46.772764467280986]
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process.
In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space)
A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly.
arXiv Detail & Related papers (2023-05-24T17:53:07Z) - ELITE: Encoding Visual Concepts into Textual Embeddings for Customized
Text-to-Image Generation [59.44301617306483]
We propose a learning-based encoder for fast and accurate customized text-to-image generation.
Our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
arXiv Detail & Related papers (2023-02-27T14:49:53Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.