BootPIG: Bootstrapping Zero-shot Personalized Image Generation
Capabilities in Pretrained Diffusion Models
- URL: http://arxiv.org/abs/2401.13974v1
- Date: Thu, 25 Jan 2024 06:18:20 GMT
- Title: BootPIG: Bootstrapping Zero-shot Personalized Image Generation
Capabilities in Pretrained Diffusion Models
- Authors: Senthil Purushwalkam, Akash Gokul, Shafiq Joty, Nikhil Naik
- Abstract summary: We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images.
The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model.
In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour.
- Score: 33.6421568407629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image generation models have demonstrated incredible success
in generating images that faithfully follow input prompts. However, the
requirement of using words to describe a desired concept provides limited
control over the appearance of the generated concepts. In this work, we address
this shortcoming by proposing an approach to enable personalization
capabilities in existing text-to-image diffusion models. We propose a novel
architecture (BootPIG) that allows a user to provide reference images of an
object in order to guide the appearance of a concept in the generated images.
The proposed BootPIG architecture makes minimal modifications to a pretrained
text-to-image diffusion model and utilizes a separate UNet model to steer the
generations toward the desired appearance. We introduce a training procedure
that allows us to bootstrap personalization capabilities in the BootPIG
architecture using data generated from pretrained text-to-image models, LLM
chat agents, and image segmentation models. In contrast to existing methods
that require several days of pretraining, the BootPIG architecture can be
trained in approximately 1 hour. Experiments on the DreamBooth dataset
demonstrate that BootPIG outperforms existing zero-shot methods while being
comparable with test-time finetuning approaches. Through a user study, we
validate the preference for BootPIG generations over existing methods both in
maintaining fidelity to the reference object's appearance and aligning with
textual prompts.
Related papers
- Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions [33.440447854396446]
We train the first open-source text-to-image model on long structured captions.<n>To process long captions efficiently, we propose DimFusion.<n>We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol.
arXiv Detail & Related papers (2025-11-10T09:25:25Z) - More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models [53.98725993420285]
Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models.<n>We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model.
arXiv Detail & Related papers (2025-10-27T17:44:56Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - BOSC: A Backdoor-based Framework for Open Set Synthetic Image Attribution [22.81354665006496]
Synthetic image attribution addresses the problem of tracing back the origin of images produced by generative models.
We propose a framework for open set attribution of synthetic images, named BOSC, that relies on the concept of backdoor attacks.
arXiv Detail & Related papers (2024-05-19T09:17:43Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Customize StyleGAN with One Hand Sketch [0.0]
We propose a framework to control StyleGAN imagery with a single user sketch.
We learn a conditional distribution in the latent space of a pre-trained StyleGAN model via energy-based learning.
Our model can generate multi-modal images semantically aligned with the input sketch.
arXiv Detail & Related papers (2023-10-29T09:32:33Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.