EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance
- URL: http://arxiv.org/abs/2409.08091v3
- Date: Sun, 24 Nov 2024 10:47:17 GMT
- Title: EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance
- Authors: Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu,
- Abstract summary: EZIGen aims to produce images that align with both a given text prompt and subject image.
It employs two main components: a carefully crafted subject image encoder based on the pre-trained UNet of the Stable Diffusion model.
It achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data.
- Score: 20.430259028981094
- License:
- Abstract: Zero-shot personalized image generation models aim to produce images that align with both a given text prompt and subject image, requiring the model to effectively incorporate both sources of guidance. However, existing methods often struggle to capture fine-grained subject details and frequently prioritize one form of guidance over the other, resulting in suboptimal subject encoding and an imbalance in the generated images. In this study, we uncover key insights into achieving high-quality balances on subject identity preservation and text-following, notably that 1) the design of the subject image encoder critically influences subject identity preservation, and 2) the text and subject guidance should take effect at different denoising stages. Building on these insights, we introduce a new approach, EZIGen, that employs two main components: a carefully crafted subject image encoder based on the pre-trained UNet of the Stable Diffusion model, following a process that balances the two guidances by separating their dominance stage and revisiting certain time steps to bootstrap subject transfer quality. Through these two components, EZIGen achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data. Demo Page: zichengduan.github.io/pages/EZIGen/index.html
Related papers
- IC-Portrait: In-Context Matching for View-Consistent Personalized Portrait [51.18967854258571]
IC-Portrait is a novel framework designed to accurately encode individual identities for personalized portrait generation.
Our key insight is that pre-trained diffusion models are fast learners for in-context dense correspondence matching.
We show that IC-Portrait consistently outperforms existing state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2025-01-28T18:59:03Z) - Discriminative Image Generation with Diffusion Models for Zero-Shot Learning [53.44301001173801]
We present DIG-ZSL, a novel Discriminative Image Generation framework for Zero-Shot Learning.
We learn a discriminative class token (DCT) for each unseen class under the guidance of a pre-trained category discrimination model (CDM)
In this paper, the extensive experiments and visualizations on four datasets show that our DIG-ZSL: (1) generates diverse and high-quality images, (2) outperforms previous state-of-the-art nonhuman-annotated semantic prototype-based methods by a large margin, and (3) achieves comparable or better performance than baselines that leverage human-annot
arXiv Detail & Related papers (2024-12-23T02:18:54Z) - MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance [6.4680449907623006]
This research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects.
The proposed multi-subject cross-attention orchestrates inter-subject compositions while preserving the control of texts.
arXiv Detail & Related papers (2024-06-11T12:32:53Z) - DreamTuner: Single Image is Enough for Subject-Driven Generation [16.982780785747202]
Diffusion-based models have demonstrated impressive capabilities for text-to-image generation.
However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models.
We propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively.
arXiv Detail & Related papers (2023-12-21T09:37:14Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Towards Unsupervised Deep Image Enhancement with Generative Adversarial
Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN)
It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner.
Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.