TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition
- URL: http://arxiv.org/abs/2307.12493v4
- Date: Tue, 10 Oct 2023 04:23:11 GMT
- Title: TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition
- Authors: Shilin Lu, Yanzhu Liu, Adams Wai-Kin Kong
- Abstract summary: TF-ICON is a framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition.
TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization.
Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets.
- Score: 13.087647740473205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven diffusion models have exhibited impressive generative
capabilities, enabling various image editing tasks. In this paper, we propose
TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the
power of text-driven diffusion models for cross-domain image-guided
composition. This task aims to seamlessly integrate user-provided objects into
a specific visual context. Current diffusion-based methods often involve costly
instance-based optimization or finetuning of pretrained models on customized
datasets, which can potentially undermine their rich prior. In contrast,
TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain
image-guided composition without requiring additional training, finetuning, or
optimization. Moreover, we introduce the exceptional prompt, which contains no
information, to facilitate text-driven diffusion models in accurately inverting
real images into latent representations, forming the basis for compositing. Our
experiments show that equipping Stable Diffusion with the exceptional prompt
outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ,
COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile
visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON
Related papers
- TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models.
We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization.
Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z) - Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification [49.41632476658246]
We discuss the extension of DFKD to Vision-Language Foundation Models without access to the billion-level image-text datasets.
The objective is to customize a student model for distribution-agnostic downstream tasks with given category concepts.
We propose three novel Prompt Diversification methods to encourage image synthesis with diverse styles.
arXiv Detail & Related papers (2024-07-21T13:26:30Z) - InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture [0.0]
InsertDiffusion is a training-free diffusion architecture that efficiently embeds objects into images.
Our approach utilizes off-the-shelf generative models and eliminates the need for fine-tuning.
By decomposing the generation task into independent steps, InsertDiffusion offers a scalable solution.
arXiv Detail & Related papers (2024-07-15T10:15:58Z) - FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior [50.0535198082903]
We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image.
We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
arXiv Detail & Related papers (2024-07-06T03:35:43Z) - DiffPop: Plausibility-Guided Object Placement Diffusion for Image Composition [13.341996441742374]
DiffPop is a framework that learns the scale and spatial relations among multiple objects and the corresponding scene image.
We develop a human-in-the-loop pipeline which exploits human labeling on the diffusion-generated composite images.
Our dataset and code will be released.
arXiv Detail & Related papers (2024-06-12T03:40:17Z) - YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences.
We analyze how these choices affect both the efficiency of the training process and the quality of the generated images.
We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z) - ControlCom: Controllable Image Composition using Diffusion Model [45.48263800282992]
We propose a controllable image composition method that unifies four tasks in one diffusion model.
We also propose a local enhancement module to enhance the foreground details in the diffusion model.
The proposed method is evaluated on both public benchmark and real-world data.
arXiv Detail & Related papers (2023-08-19T14:56:44Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.