Related papers: Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

URL: http://arxiv.org/abs/2304.02642v1
Date: Wed, 5 Apr 2023 17:59:32 GMT
Title: Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
Authors: Xuhui Jia, Yang Zhao, Kelvin C.K. Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, Yu-Chuan Su
Abstract summary: This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
Score: 55.04969603431266
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes a method for generating images of customized objects specified by users. The method is based on a general framework that bypasses the lengthy optimization required by previous approaches, which often employ a per-object optimization paradigm. Our framework adopts an encoder to capture high-level identifiable semantics of objects, producing an object-specific embedding with only a single feed-forward pass. The acquired object embedding is then passed to a text-to-image synthesis model for subsequent generation. To effectively blend a object-aware embedding space into a well developed text-to-image model under the same generation context, we investigate different network designs and training strategies, and propose a simple yet effective regularized joint training scheme with an object identity preservation loss. Additionally, we propose a caption generation scheme that become a critical piece in fostering object specific embedding faithfully reflected into the generation process, while keeping control and editing abilities. Once trained, the network is able to produce diverse content and styles, conditioned on both texts and objects. We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity, without the need of test-time optimization. Systematic studies are also conducted to analyze our models, providing insights for future work.

Related papers

Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models [20.582222123619285]
We propose a training-free framework that formulates personalized content editing as the optimization of edited images in the latent space. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class. Our method excels in object replacement even with a large domain gap.
arXiv Detail & Related papers (2025-03-06T08:52:29Z)
Generating Multi-Image Synthetic Data for Text-to-Image Customization [48.59231755159313]
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision. We propose a simple approach that addresses both limitations.
arXiv Detail & Related papers (2025-02-03T18:59:41Z)
Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning [40.06403155373455]
We propose a novel reinforcement learning framework for personalized text-to-image generation. Our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment.
arXiv Detail & Related papers (2024-07-09T08:11:53Z)
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation [49.997839600988875]
Existing personalization methods rely on finetuning a text-to-image foundation model on a user's custom dataset. We propose Joint-Image Diffusion (jedi), an effective technique for learning a finetuning-free personalization model. Our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
arXiv Detail & Related papers (2024-07-08T17:59:02Z)
ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes [64.57705752579207]
We evaluate the resilience of vision-based models against diverse object-to-background context variations. We harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate object-to-background changes.
arXiv Detail & Related papers (2024-03-07T17:48:48Z)
Training-Free Consistent Text-to-Image Generation [80.4814768762066]
Text-to-image models can portray the same subject across diverse prompts. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects. We present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model.
arXiv Detail & Related papers (2024-02-05T18:42:34Z)
Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding [7.893308498886083]
Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way. A prototypical embedding is based on the object's appearance and its class, before fine-tuning the diffusion model. Our method outperforms several existing works.
arXiv Detail & Related papers (2024-01-28T17:11:42Z)
CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models [85.69959024572363]
CustomNet is a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. We introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images. Our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background.
arXiv Detail & Related papers (2023-10-30T17:50:14Z)
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z)
Generating Annotated High-Fidelity Images Containing Multiple Coherent Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information. We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z)
Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects. Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity. We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.