Related papers: Key-Locked Rank One Editing for Text-to-Image Personalization

Key-Locked Rank One Editing for Text-to-Image Personalization

URL: http://arxiv.org/abs/2305.01644v2
Date: Wed, 5 Jun 2024 10:24:43 GMT
Title: Key-Locked Rank One Editing for Text-to-Image Personalization
Authors: Yoad Tewel, Rinon Gal, Gal Chechik, Yuval Atzmon,
Abstract summary: We present Perfusion, a T2I personalization method that addresses challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. We show that Perfusion outperforms strong baselines in both qualitative and quantitative terms.
Score: 43.195870616558935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.

Related papers

Personalize Anything for Free with Diffusion Transformer [20.385520869825413]
Recent training-free approaches struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs) We uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. We propose textbfPersonalize Anything, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity.
arXiv Detail & Related papers (2025-03-16T17:51:16Z)
IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait [51.18967854258571]
IC-Portrait is a novel framework designed to accurately encode individual identities for personalized portrait generation. Our key insight is that pre-trained diffusion models are fast learners for in-context dense correspondence matching. We show that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2025-01-28T18:59:03Z)
MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models [51.1034358143232]
We introduce component-controllable personalization, a novel task that pushes the boundaries of text-to-image (T2I) models. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics.
arXiv Detail & Related papers (2024-10-17T09:22:53Z)
AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization [4.544788024283586]
AttenCraft is an attention-guided method for multiple concept disentanglement. We introduce Uniform sampling and Reweighted sampling schemes to alleviate the non-synchronicity of feature acquisition from different concepts. Our method outperforms baseline models in terms of image-alignment, and behaves comparably on text-alignment.
arXiv Detail & Related papers (2024-05-28T08:50:14Z)
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation [49.935634230341904]
We introduce the Multi-concept guidance for Multi-concept customization, termed MC$2$, for improved flexibility and fidelity. MC$2$ decouples the requirements for model architecture via inference time optimization. It adaptively refines the attention weights between visual and textual tokens, directing image regions to focus on their associated words.
arXiv Detail & Related papers (2024-04-08T07:59:04Z)
Attention Calibration for Disentangled Text-to-Image Personalization [12.339742346826403]
We propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-03-27T13:31:39Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models [14.657472801570284]
PIA excels in aligning with condition images, achieving motion controllability by text, and the compatibility with various personalized T2I models without specific tuning. A key component of PIA is the introduction of the condition module, which utilizes the condition frame and inter-frame affinity as input to transfer appearance information.
arXiv Detail & Related papers (2023-12-21T15:51:12Z)
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts. Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z)
Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition. We propose augmenting the input image with masks that indicate the presence of target concepts. We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z)
Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization. We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.