Related papers: A Neural Space-Time Representation for Text-to-Image Personalization

A Neural Space-Time Representation for Text-to-Image Personalization

URL: http://arxiv.org/abs/2305.15391v1
Date: Wed, 24 May 2023 17:53:07 GMT
Title: A Neural Space-Time Representation for Text-to-Image Personalization
Authors: Yuval Alaluf, Elad Richardson, Gal Metzer, Daniel Cohen-Or
Abstract summary: A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly.
Score: 46.772764467280986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.

Related papers

"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace [52.24866347353916]
We propose an efficient method to explore the target embedding in a textual subspace. We also propose an efficient selection strategy for determining the basis of the textual subspace. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
arXiv Detail & Related papers (2024-06-30T06:41:21Z)
CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization [56.892032386104006]
CatVersion is an inversion-based method that learns the personalized concept through a handful of examples. Users can utilize text prompts to generate images that embody the personalized concept.
arXiv Detail & Related papers (2023-11-24T17:55:10Z)
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts. Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z)
ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models [21.15548013842187]
We propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings. By dropping the text encoder, we are able to significantly speed up the learning process. Our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks.
arXiv Detail & Related papers (2023-05-30T12:45:49Z)
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation [59.44301617306483]
We propose a learning-based encoder for fast and accurate customized text-to-image generation. Our method enables high-fidelity inversion and more robust editability with a significantly faster encoding process.
arXiv Detail & Related papers (2023-02-27T14:49:53Z)
Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization. We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.