Subject-driven Text-to-Image Generation via Apprenticeship Learning
- URL: http://arxiv.org/abs/2304.00186v5
- Date: Mon, 2 Oct 2023 08:08:45 GMT
- Title: Subject-driven Text-to-Image Generation via Apprenticeship Learning
- Authors: Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei
Chang, William W. Cohen
- Abstract summary: We present SuTI, a subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning.
SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models.
We show that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth.
- Score: 83.88256453081607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image generation models like DreamBooth have made remarkable
progress in generating highly customized images of a target subject, by
fine-tuning an ``expert model'' for a given subject from a few examples.
However, this process is expensive, since a new expert model must be learned
for each subject. In this paper, we present SuTI, a Subject-driven
Text-to-Image generator that replaces subject-specific fine tuning with
in-context learning. Given a few demonstrations of a new subject, SuTI can
instantly generate novel renditions of the subject in different scenes, without
any subject-specific optimization. SuTI is powered by apprenticeship learning,
where a single apprentice model is learned from data generated by a massive
number of subject-specific expert models. Specifically, we mine millions of
image clusters from the Internet, each centered around a specific visual
subject. We adopt these clusters to train a massive number of expert models,
each specializing in a different subject. The apprentice model SuTI then learns
to imitate the behavior of these fine-tuned experts. SuTI can generate
high-quality and customized subject-specific images 20x faster than
optimization-based SoTA methods. On the challenging DreamBench and
DreamBench-v2, our human evaluation shows that SuTI significantly outperforms
existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt,
Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.
Related papers
- JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models [66.05234562835136]
We present MuDI, a novel framework that enables multi-subject personalization.
Our main idea is to utilize segmented subjects generated by a foundation model for segmentation.
Experimental results show that our MuDI can produce high-quality personalized images without identity mixing.
arXiv Detail & Related papers (2024-04-05T17:45:22Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - InstructBooth: Instruction-following Personalized Text-to-Image
Generation [30.89054609185801]
InstructBooth is a novel method designed to enhance image-text alignment in personalized text-to-image models.
Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier.
After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment.
arXiv Detail & Related papers (2023-12-04T20:34:46Z) - ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts.
We show that, for some attributes, images can represent concepts more expressively than text.
We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z) - Diffusion idea exploration for art generation [0.10152838128195467]
Diffusion models have recently outperformed other generative models in image generation tasks using cross modal data as guiding information.
The initial experiments for this task of novel image generation demonstrated promising qualitative results.
arXiv Detail & Related papers (2023-07-11T02:35:26Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.