Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
- URL: http://arxiv.org/abs/2407.00608v1
- Date: Sun, 30 Jun 2024 06:41:21 GMT
- Title: Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace
- Authors: Shian Du, Xiaotian Cheng, Qi Qian, Henglu Wei, Yi Xu, Xiangyang Ji,
- Abstract summary: We propose an efficient method to explore the target embedding in a textual subspace.
We also propose an efficient selection strategy for determining the basis of the textual subspace.
Our method opens the door to more efficient representation learning for personalized text-to-image generation.
- Score: 52.24866347353916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalized text-to-image generation has attracted unprecedented attention in the recent few years due to its unique capability of generating highly-personalized images via using the input concept dataset and novel textual prompt. However, previous methods solely focus on the performance of the reconstruction task, degrading its ability to combine with different textual prompt. Besides, optimizing in the high-dimensional embedding space usually leads to unnecessary time-consuming training process and slow convergence. To address these issues, we propose an efficient method to explore the target embedding in a textual subspace, drawing inspiration from the self-expressiveness property. Additionally, we propose an efficient selection strategy for determining the basis vectors of the textual subspace. The experimental evaluations demonstrate that the learned embedding can not only faithfully reconstruct input image, but also significantly improves its alignment with novel input textual prompt. Furthermore, we observe that optimizing in the textual subspace leads to an significant improvement of the robustness to the initial word, relaxing the constraint that requires users to input the most relevant initial word. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
Related papers
- ARTIST: Improving the Generation of Text-rich Images by Disentanglement [52.23899502520261]
We introduce a new framework named ARTIST to focus on the learning of text structures.
We finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model.
Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
arXiv Detail & Related papers (2024-06-17T19:31:24Z) - Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization [23.04290567321589]
A surge of text-to-image (T2I) models and their customization methods generate new images of a user-provided subject.
These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance.
We propose visual embedding which effectively harmonizes with the given textual embedding.
We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap.
arXiv Detail & Related papers (2024-03-21T06:03:51Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Stellar: Systematic Evaluation of Human-Centric Personalized
Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context.
We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available.
We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z) - From Text to Mask: Localizing Entities Using the Attention of
Text-to-Image Diffusion Models [41.66656119637025]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models.
We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting.
Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - A Neural Space-Time Representation for Text-to-Image Personalization [46.772764467280986]
A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process.
In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space)
A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly.
arXiv Detail & Related papers (2023-05-24T17:53:07Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.