AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment
- URL: http://arxiv.org/abs/2505.21911v1
- Date: Wed, 28 May 2025 02:57:55 GMT
- Title: AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment
- Authors: Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei,
- Abstract summary: We propose AlignGen, a Cross-Modality Prior Alignment mechanism for personalized image generation.<n>We show that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
- Score: 74.47138661595584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.
Related papers
- In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation [41.79836820271156]
"In-Context Brush" is a zero-shot framework for customized subject insertion.<n>We formulate the object image and the textual prompts as cross-modal demonstrations.<n>The goal is to inpaint the target image with the subject aligning textual prompts without model tuning.
arXiv Detail & Related papers (2025-05-26T17:49:10Z) - FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation [21.181545626612028]
We propose FreeGraftor, a training-free framework for subject-driven image generation.<n>FreeGraftor employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated image.<n>Our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment.
arXiv Detail & Related papers (2025-04-22T14:55:23Z) - HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation [36.58332467324404]
HybridBooth is a framework for personalized text-to-image diffusion models.
It generates robust initial word embedding using a fine-tuned encoder.
It further adapts the encoder to specific subject images by optimizing key parameters.
arXiv Detail & Related papers (2024-10-10T17:58:19Z) - Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace [52.24866347353916]
We propose an efficient method to explore the target embedding in a textual subspace.
We also propose an efficient selection strategy for determining the basis of the textual subspace.
Our method opens the door to more efficient representation learning for personalized text-to-image generation.
arXiv Detail & Related papers (2024-06-30T06:41:21Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [5.452759083801634]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.<n>The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - Tuning-Free Image Customization with Image and Text Guidance [65.9504243633169]
We introduce a tuning-free framework for simultaneous text-image-guided image customization.
Our approach preserves the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions.
Our approach outperforms previous methods in both human and quantitative evaluations.
arXiv Detail & Related papers (2024-03-19T11:48:35Z) - The Chosen One: Consistent Characters in Text-to-Image Diffusion Models [71.15152184631951]
We propose a fully automated solution for consistent character generation with the sole input being a text prompt.
Our method strikes a better balance between prompt alignment and identity consistency compared to the baseline methods.
arXiv Detail & Related papers (2023-11-16T18:59:51Z) - Conditional Score Guidance for Text-Driven Image-to-Image Translation [52.73564644268749]
We present a novel algorithm for text-driven image-to-image translation based on a pretrained text-to-image diffusion model.
Our method aims to generate a target image by selectively editing the regions of interest in a source image.
arXiv Detail & Related papers (2023-05-29T10:48:34Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.