Related papers: Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

URL: http://arxiv.org/abs/2403.15330v1
Date: Fri, 22 Mar 2024 16:35:38 GMT
Title: Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization
Authors: Jimyeong Kim, Jungwon Park, Wonjong Rhee,
Abstract summary: We propose SID(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.
Score: 5.141049647900161
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In text-to-image personalization, a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background, nearby-object, tied-object, substance (in style re-contextualization), and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge, we propose SID~(Selectively Informative Description), a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps, subject-alignment, non-subject-disentanglement, and text-alignment.

Related papers

Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion [13.846868357952419]
Diffusion models have become widely adopted in image completion tasks. A persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. We propose supplementing text-based guidance with a novel visual aid: a casual sketch. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background.
arXiv Detail & Related papers (2025-03-10T08:34:31Z)
Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models [20.582222123619285]
We propose a training-free framework that formulates personalized content editing as the optimization of edited images in the latent space. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class. Our method excels in object replacement even with a large domain gap.
arXiv Detail & Related papers (2025-03-06T08:52:29Z)
DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization [15.920735314050296]
This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry. We propose DECOR, which projects text embeddings onto a vector space to undesired token vectors. Experimental results demonstrate that DECOR outperforms state-of-the-art customization models.
arXiv Detail & Related papers (2024-12-12T10:59:44Z)
Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z)
MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z)
Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation [22.949365270116335]
We propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation.
arXiv Detail & Related papers (2024-05-11T08:11:25Z)
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation [60.943159830780154]
We introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. We demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
arXiv Detail & Related papers (2024-03-25T17:52:07Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images. Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept. We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z)
Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods [52.806258774051216]
We focus on text-to-image systems that input a single image of an individual and ground the generation process along with text describing the desired visual context. We introduce a standardized dataset (Stellar) that contains personalized prompts coupled with images of individuals that is an order of magnitude larger than existing relevant datasets and where rich semantic ground-truth annotations are readily available. We derive a simple yet efficient, personalized text-to-image baseline that does not require test-time fine-tuning for each subject and which sets quantitatively and in human trials a new SoTA.
arXiv Detail & Related papers (2023-12-11T04:47:39Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD) In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z)
DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation [50.39533637201273]
We propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. By combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
arXiv Detail & Related papers (2023-05-05T09:08:25Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.