Related papers: JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

URL: http://arxiv.org/abs/2407.06187v1
Date: Mon, 8 Jul 2024 17:59:02 GMT
Title: JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Authors: Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, Yogesh Balaji,
Abstract summary: Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset. We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model. Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
Score: 49.997839600988875
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.

Related papers

Flux Already Knows -- Activating Subject-Driven Image Generation without Training [25.496237241889048]
We propose a zero-shot framework for subject-driven image generation using a vanilla Flux model. We activate strong identity-preserving capabilities without any additional data, training, or inference-time fine-tuning.
arXiv Detail & Related papers (2025-04-12T20:41:53Z)
Single Image Iterative Subject-driven Generation and Editing [40.285860652338506]
We present SISO, a training-free approach to personalize image generation and editing from a single image without training. SISO iteratively generates images and optimize the model based on loss of similarity with the given subject image. We demonstrate significant improvements over existing methods in image quality, subject fidelity, and background preservation.
arXiv Detail & Related papers (2025-03-20T10:45:04Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt. Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
Imagine yourself: Tuning-Free Personalized Image Generation [39.63411174712078]
We introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. It operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment.
arXiv Detail & Related papers (2024-09-20T09:21:49Z)
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation. Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z)
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models [66.05234562835136]
We present MuDI, a novel framework that enables multi-subject personalization. Our main idea is to utilize segmented subjects generated by a foundation model for segmentation. Experimental results show that our MuDI can produce high-quality personalized images without identity mixing.
arXiv Detail & Related papers (2024-04-05T17:45:22Z)
Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models [67.68871360210208]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, can generate visuals with a high degree of consistency. We propose a novel fine-tuning objective, dubbed Direct Consistency Optimization, which controls the deviation between fine-tuning and pretrained models. We show that our approach achieves better prompt fidelity and subject fidelity than those post-optimized for merging regular fine-tuned models.
arXiv Detail & Related papers (2024-02-19T09:52:41Z)
Pick-and-Draw: Training-free Semantic Guidance for Text-to-Image Personalization [56.12990759116612]
Pick-and-Draw is a training-free semantic guidance approach to boost identity consistency and generative diversity for personalization methods. The proposed approach can be applied to any personalized diffusion models and requires as few as a single reference image.
arXiv Detail & Related papers (2024-01-30T05:56:12Z)
PALP: Prompt Aligned Personalization of Text-to-Image Models [68.91005384187348]
Existing personalization methods compromise personalization ability or the alignment to complex prompts. We propose a new approach focusing on personalization methods for a emphsingle prompt to address this issue. Our method excels in improving text alignment, enabling the creation of images with complex and intricate prompts.
arXiv Detail & Related papers (2024-01-11T18:35:33Z)
Identity Encoder for Personalized Diffusion [57.1198884486401]
We propose an encoder-based approach for personalization. We learn an identity encoder which can extract an identity representation from a set of reference images of a subject. We show that our approach consistently outperforms existing fine-tuning based approach in both image generation and reconstruction.
arXiv Detail & Related papers (2023-04-14T23:32:24Z)
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning [20.127745565621616]
We propose InstantBooth, a novel approach built upon pre-trained text-to-image models. Our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation.
arXiv Detail & Related papers (2023-04-06T23:26:38Z)
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.