Personalize Anything for Free with Diffusion Transformer
- URL: http://arxiv.org/abs/2503.12590v1
- Date: Sun, 16 Mar 2025 17:51:16 GMT
- Title: Personalize Anything for Free with Diffusion Transformer
- Authors: Haoran Feng, Zehuan Huang, Lin Li, Hairong Lv, Lu Sheng,
- Abstract summary: Recent training-free approaches struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs)<n>We uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction.<n>We propose textbfPersonalize Anything, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity.
- Score: 20.385520869825413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose \textbf{Personalize Anything}, a training-free framework that achieves personalized image generation in DiT through: 1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and 2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.
Related papers
- Flux Already Knows -- Activating Subject-Driven Image Generation without Training [25.496237241889048]
We propose a zero-shot framework for subject-driven image generation using a vanilla Flux model.
We activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning.
arXiv Detail & Related papers (2025-04-12T20:41:53Z) - Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models [20.582222123619285]
We propose a training-free framework that formulates personalized content editing as the optimization of edited images in the latent space.<n>A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class.<n>Our method excels in object replacement even with a large domain gap.
arXiv Detail & Related papers (2025-03-06T08:52:29Z) - PersonaMagic: Stage-Regulated High-Fidelity Face Customization with Tandem Equilibrium [55.72249032433108]
PersonaMagic is a stage-regulated generative technique designed for high-fidelity face customization.<n>Our method learns a series of embeddings within a specific timestep interval to capture face concepts.<n>Tests confirm the superiority of PersonaMagic over state-of-the-art methods in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-12-20T08:41:25Z) - Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved
Personalization [92.90392834835751]
PortraitBooth is designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation.
PortraitBooth eliminates computational overhead and mitigates identity distortion.
It incorporates emotion-aware cross-attention control for diverse facial expressions in generated images.
arXiv Detail & Related papers (2023-12-11T13:03:29Z) - PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion
Models [19.519789922033034]
PhotoVerse is an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains.
After a single training phase, our approach enables generating high-quality images within only a few seconds.
arXiv Detail & Related papers (2023-09-11T19:59:43Z) - Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image
Models [59.094601993993535]
Text-to-image (T2I) personalization allows users to combine their own visual concepts in natural language prompts.
Most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts.
We propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts.
arXiv Detail & Related papers (2023-07-13T17:46:42Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Highly Personalized Text Embedding for Image Manipulation by Stable
Diffusion [34.662798793560995]
We present a simple yet highly effective approach to personalization using highly personalized (PerHi) text embedding.
Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text.
arXiv Detail & Related papers (2023-03-15T17:07:45Z) - Towards a Unified Foundation Model: Jointly Pre-Training Transformers on
Unpaired Images and Text [93.11954811297652]
We design a unified transformer consisting of modality-specific tokenizers, a shared transformer encoder, and task-specific output heads.
We employ the separately-trained BERT and ViT models as teachers and apply knowledge distillation to provide additional, accurate supervision signals.
Experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks.
arXiv Detail & Related papers (2021-12-14T00:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.