Related papers: Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization

URL: http://arxiv.org/abs/2403.14155v1
Date: Thu, 21 Mar 2024 06:03:51 GMT
Title: Harmonizing Visual and Textual Embeddings for Zero-Shot Text-to-Image Customization
Authors: Yeji Song, Jimyeong Kim, Wonhark Park, Wonsik Shin, Wonjong Rhee, Nojun Kwak,
Abstract summary: A surge of text-to-image (T2I) models and their customization methods generate new images of a user-provided subject. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. We propose visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap.
Score: 23.04290567321589
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In a surge of text-to-image (T2I) models and their customization methods that generate new images of a user-provided subject, current works focus on alleviating the costs incurred by a lengthy per-subject optimization. These zero-shot customization methods encode the image of a specified subject into a visual embedding which is then utilized alongside the textual embedding for diffusion guidance. The visual embedding incorporates intrinsic information about the subject, while the textual embedding provides a new, transient context. However, the existing methods often 1) are significantly affected by the input images, eg., generating images with the same pose, and 2) exhibit deterioration in the subject's identity. We first pin down the problem and show that redundant pose information in the visual embedding interferes with the textual embedding containing the desired pose information. To address this issue, we propose orthogonal visual embedding which effectively harmonizes with the given textual embedding. We also adopt the visual-only embedding and inject the subject's clear features utilizing a self-attention swap. Our results demonstrate the effectiveness and robustness of our method, which offers highly flexible zero-shot generation while effectively maintaining the subject's identity.

Related papers

In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation [41.79836820271156]
"In-Context Brush" is a zero-shot framework for customized subject insertion.<n>We formulate the object image and the textual prompts as cross-modal demonstrations.<n>The goal is to inpaint the target image with the subject aligning textual prompts without model tuning.
arXiv Detail & Related papers (2025-05-26T17:49:10Z)
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models [115.62816053600085]
We present DesignDiffusion, a framework for synthesizing design images from textual descriptions. The proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt.
arXiv Detail & Related papers (2025-03-03T15:22:57Z)
Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace [52.24866347353916]
We propose an efficient method to explore the target embedding in a textual subspace. We also propose an efficient selection strategy for determining the basis of the textual subspace. Our method opens the door to more efficient representation learning for personalized text-to-image generation.
arXiv Detail & Related papers (2024-06-30T06:41:21Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Text Guided Image Editing with Automatic Concept Locating and Forgetting [27.70615803908037]
We propose a novel method called Locate and Forget (LaF) to locate potential target concepts in the image for modification. Compared to the baselines, our method demonstrates its superiority in text-guided image editing tasks both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-05-30T05:36:32Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
Tuning-Free Image Customization with Image and Text Guidance [65.9504243633169]
We introduce a tuning-free framework for simultaneous text-image-guided image customization. Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. Our approach outperforms previous methods in both human and quantitative evaluations.
arXiv Detail & Related papers (2024-03-19T11:48:35Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images. Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept. We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation [50.39533637201273]
We propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. By combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
arXiv Detail & Related papers (2023-05-05T09:08:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.