DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation
- URL: http://arxiv.org/abs/2208.12242v1
- Date: Thu, 25 Aug 2022 17:45:49 GMT
- Title: DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation
- Authors: Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael
Rubinstein and Kfir Aberman
- Abstract summary: We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
- Score: 26.748667878221568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large text-to-image models achieved a remarkable leap in the evolution of AI,
enabling high-quality and diverse synthesis of images from a given text prompt.
However, these models lack the ability to mimic the appearance of subjects in a
given reference set and synthesize novel renditions of them in different
contexts. In this work, we present a new approach for "personalization" of
text-to-image diffusion models (specializing them to users' needs). Given as
input just a few images of a subject, we fine-tune a pretrained text-to-image
model (Imagen, although our method is not limited to a specific model) such
that it learns to bind a unique identifier with that specific subject. Once the
subject is embedded in the output domain of the model, the unique identifier
can then be used to synthesize fully-novel photorealistic images of the subject
contextualized in different scenes. By leveraging the semantic prior embedded
in the model with a new autogenous class-specific prior preservation loss, our
technique enables synthesizing the subject in diverse scenes, poses, views, and
lighting conditions that do not appear in the reference images. We apply our
technique to several previously-unassailable tasks, including subject
recontextualization, text-guided view synthesis, appearance modification, and
artistic rendering (all while preserving the subject's key features). Project
page: https://dreambooth.github.io/
Related papers
- Imagine yourself: Tuning-Free Personalized Image Generation [39.63411174712078]
We introduce Imagine yourself, a state-of-the-art model designed for personalized image generation.
It operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments.
Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment.
arXiv Detail & Related papers (2024-09-20T09:21:49Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - ProSpect: Prompt Spectrum for Attribute-Aware Personalization of
Diffusion Models [77.03361270726944]
Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models.
We propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information.
We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout.
arXiv Detail & Related papers (2023-05-25T16:32:01Z) - Subject-driven Text-to-Image Generation via Apprenticeship Learning [83.88256453081607]
We present SuTI, a subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning.
SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models.
We show that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth.
arXiv Detail & Related papers (2023-04-01T00:47:35Z) - Zero-shot Generation of Coherent Storybook from Plain Text Story using
Diffusion Models [43.32978092618245]
We present a novel neural pipeline for generating a coherent storybook from the plain text of a story.
We leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images.
arXiv Detail & Related papers (2023-02-08T06:24:06Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.