Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2304.02642v1
- Date: Wed, 5 Apr 2023 17:59:32 GMT
- Title: Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models
- Authors: Xuhui Jia, Yang Zhao, Kelvin C.K. Chan, Yandong Li, Han Zhang, Boqing
Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su
- Abstract summary: This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
- Score: 55.04969603431266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a method for generating images of customized objects
specified by users. The method is based on a general framework that bypasses
the lengthy optimization required by previous approaches, which often employ a
per-object optimization paradigm. Our framework adopts an encoder to capture
high-level identifiable semantics of objects, producing an object-specific
embedding with only a single feed-forward pass. The acquired object embedding
is then passed to a text-to-image synthesis model for subsequent generation. To
effectively blend a object-aware embedding space into a well developed
text-to-image model under the same generation context, we investigate different
network designs and training strategies, and propose a simple yet effective
regularized joint training scheme with an object identity preservation loss.
Additionally, we propose a caption generation scheme that become a critical
piece in fostering object specific embedding faithfully reflected into the
generation process, while keeping control and editing abilities. Once trained,
the network is able to produce diverse content and styles, conditioned on both
texts and objects. We demonstrate through experiments that our proposed method
is able to synthesize images with compelling output quality, appearance
diversity, and object fidelity, without the need of test-time optimization.
Systematic studies are also conducted to analyze our models, providing insights
for future work.
Related papers
- Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation.
Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z) - JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset.
We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model.
Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z) - ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes [64.57705752579207]
We evaluate the resilience of vision-based models against diverse object-to-background context variations.
We harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate object-to-background changes.
arXiv Detail & Related papers (2024-03-07T17:48:48Z) - Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts.
Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects.
We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z) - Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with
Prototypical Embedding [7.893308498886083]
Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way.
A prototypical embedding is based on the object's appearance and its class, before fine-tuning the diffusion model.
Our method outperforms several existing works.
arXiv Detail & Related papers (2024-01-28T17:11:42Z) - CustomNet: Zero-shot Object Customization with Variable-Viewpoints in
Text-to-Image Diffusion Models [85.69959024572363]
CustomNet is a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process.
We introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images.
Our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background.
arXiv Detail & Related papers (2023-10-30T17:50:14Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.