ELITE: Encoding Visual Concepts into Textual Embeddings for Customized
Text-to-Image Generation
- URL: http://arxiv.org/abs/2302.13848v2
- Date: Fri, 18 Aug 2023 17:12:13 GMT
- Title: ELITE: Encoding Visual Concepts into Textual Embeddings for Customized
Text-to-Image Generation
- Authors: Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng
Zuo
- Abstract summary: We propose a learning-based encoder for fast and accurate customized text-to-image generation.
Our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
- Score: 59.44301617306483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In addition to the unprecedented ability in imaginary creation, large
text-to-image models are expected to take customized concepts in image
generation. Existing works generally learn such concepts in an
optimization-based manner, yet bringing excessive computation or memory burden.
In this paper, we instead propose a learning-based encoder, which consists of a
global and a local mapping networks for fast and accurate customized
text-to-image generation. In specific, the global mapping network projects the
hierarchical features of a given image into multiple new words in the textual
word embedding space, i.e., one primary word for well-editable concept and
other auxiliary words to exclude irrelevant disturbances (e.g., background). In
the meantime, a local mapping network injects the encoded patch features into
cross attention layers to provide omitted details, without sacrificing the
editability of primary concepts. We compare our method with existing
optimization-based approaches on a variety of user-defined concepts, and
demonstrate that our method enables high-fidelity inversion and more robust
editability with a significantly faster encoding process. Our code is publicly
available at https://github.com/csyxwei/ELITE.
Related papers
- Gen4Gen: Generative Data Pipeline for Generative Multi-Concept
Composition [47.07564907486087]
Recent text-to-image diffusion models are able to learn and synthesize images containing novel, personalized concepts.
This paper tackles two interconnected issues within this realm of personalizing text-to-image diffusion models.
arXiv Detail & Related papers (2024-02-23T18:55:09Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image
Personalization [56.892032386104006]
CatVersion is an inversion-based method that learns the personalized concept through a handful of examples.
Users can utilize text prompts to generate images that embody the personalized concept.
arXiv Detail & Related papers (2023-11-24T17:55:10Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - A Neural Space-Time Representation for Text-to-Image Personalization [46.772764467280986]
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process.
In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space)
A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly.
arXiv Detail & Related papers (2023-05-24T17:53:07Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z) - FlexIT: Towards Flexible Semantic Image Translation [59.09398209706869]
We propose FlexIT, a novel method which can take any input image and a user-defined text instruction for editing.
First, FlexIT combines the input image and text into a single target point in the CLIP multimodal embedding space.
We iteratively transform the input image toward the target point, ensuring coherence and quality with a variety of novel regularization terms.
arXiv Detail & Related papers (2022-03-09T13:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.