DreamTuner: Single Image is Enough for Subject-Driven Generation
- URL: http://arxiv.org/abs/2312.13691v1
- Date: Thu, 21 Dec 2023 09:37:14 GMT
- Title: DreamTuner: Single Image is Enough for Subject-Driven Generation
- Authors: Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu and Qian He
- Abstract summary: Diffusion-based models have demonstrated impressive capabilities for text-to-image generation.
However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models.
We propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively.
- Score: 16.982780785747202
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based models have demonstrated impressive capabilities for
text-to-image generation and are expected for personalized applications of
subject-driven generation, which require the generation of customized concepts
with one or a few reference images. However, existing methods based on
fine-tuning fail to balance the trade-off between subject learning and the
maintenance of the generation capabilities of pretrained models. Moreover,
other methods that utilize additional image encoders tend to lose important
details of the subject due to encoding compression. To address these
challenges, we propose DreamTurner, a novel method that injects reference
information from coarse to fine to achieve subject-driven image generation more
effectively. DreamTurner introduces a subject-encoder for coarse subject
identity preservation, where the compressed general subject features are
introduced through an attention layer before visual-text cross-attention. We
then modify the self-attention layers within pretrained text-to-image models to
self-subject-attention layers to refine the details of the target subject. The
generated image queries detailed features from both the reference image and
itself in self-subject-attention. It is worth emphasizing that
self-subject-attention is an effective, elegant, and training-free method for
maintaining the detailed features of customized subjects and can serve as a
plug-and-play solution during inference. Finally, with additional
subject-driven fine-tuning, DreamTurner achieves remarkable performance in
subject-driven image generation, which can be controlled by a text or other
conditions such as pose. For further details, please visit the project page at
https://dreamtuner-diffusion.github.io/.
Related papers
- AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation [14.68987039472664]
We propose AnyStory, a unified approach for personalized subject generation.
AnyStory achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity.
arXiv Detail & Related papers (2025-01-16T12:28:39Z) - Nested Attention: Semantic-aware Attention Values for Concept Personalization [78.90196530697897]
We introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers.
Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image.
arXiv Detail & Related papers (2025-01-02T18:52:11Z) - Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance [20.430259028981094]
EZIGen aims to produce images that align with both a given text prompt and subject image.
It employs two main components: a carefully crafted subject image encoder based on the pre-trained UNet of the Stable Diffusion model.
It achieves state-of-the-art results on multiple personalized generation benchmarks with a unified model and 100 times less training data.
arXiv Detail & Related papers (2024-09-12T14:44:45Z) - Decoupled Textual Embeddings for Customized Image Generation [62.98933630971543]
Customized text-to-image generation aims to learn user-specified concepts with a few images.
Existing methods usually suffer from overfitting issues and entangle the subject-unrelated information with the learned concept.
We propose the DETEX, a novel approach that learns the disentangled concept embedding for flexible customized text-to-image generation.
arXiv Detail & Related papers (2023-12-19T03:32:10Z) - Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing
with Pre-Trained Diffusion Model [22.975965453227477]
We introduce a new framework called textitPaste, Inpaint and Harmonize via Denoising (PhD)
In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject.
arXiv Detail & Related papers (2023-06-13T07:43:10Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven
Text-to-Image Generation [50.39533637201273]
We propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation.
By combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.
arXiv Detail & Related papers (2023-05-05T09:08:25Z) - DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter [63.622879199281705]
Some example-based image generation approaches have been proposed, emphi.e. generating new concepts based on absorbing the salient features of a few input references.
We propose a simple yet effective framework, namely DreamArtist, which adopts a novel positive-negative prompt-tuning learning strategy on the pre-trained diffusion model.
We have conducted extensive experiments and evaluated the proposed method from image similarity (fidelity) and diversity, generation controllability, and style cloning.
arXiv Detail & Related papers (2022-11-21T10:37:56Z) - DreamBooth: Fine Tuning Text-to-Image Diffusion Models for
Subject-Driven Generation [26.748667878221568]
We present a new approach for "personalization" of text-to-image models.
We fine-tune a pretrained text-to-image model to bind a unique identifier with that specific subject.
The unique identifier can then be used to synthesize fully photorealistic-novel images of the subject contextualized in different scenes.
arXiv Detail & Related papers (2022-08-25T17:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.