Related papers: FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior

URL: http://arxiv.org/abs/2407.04947v1
Date: Sat, 6 Jul 2024 03:35:43 GMT
Title: FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior
Authors: Zhekai Chen, Wen Wang, Zhen Yang, Zeqing Yuan, Hao Chen, Chunhua Shen,
Abstract summary: We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. We showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition.
Score: 50.0535198082903
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We offer a novel approach to image composition, which integrates multiple input images into a single, coherent image. Rather than concentrating on specific use cases such as appearance editing (image harmonization) or semantic editing (semantic image composition), we showcase the potential of utilizing the powerful generative prior inherent in large-scale pre-trained diffusion models to accomplish generic image composition applicable to both scenarios. We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. In addition, we introduce a novel maskguided loss to further enable flexible semantic image composition. Extensive experiments validate the superiority of our approach in achieving generic zero-shot image composition. Additionally, our approach shows promising potential in various tasks, such as object removal and multiconcept customization.

Related papers

Dataset Augmentation by Mixing Visual Concepts [3.5420134832331334]
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. We adapt the diffusion model by conditioning it with real images and novel text embeddings. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
arXiv Detail & Related papers (2024-12-19T19:42:22Z)
Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas [33.334956022229846]
We propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence.
arXiv Detail & Related papers (2024-08-28T09:22:32Z)
TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization [59.412236435627094]
TALE is a training-free framework harnessing the generative capabilities of text-to-image diffusion models. We equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition.
arXiv Detail & Related papers (2024-08-07T08:52:21Z)
DiffPop: Plausibility-Guided Object Placement Diffusion for Image Composition [13.341996441742374]
DiffPop is a framework that learns the scale and spatial relations among multiple objects and the corresponding scene image. We develop a human-in-the-loop pipeline which exploits human labeling on the diffusion-generated composite images. Our dataset and code will be released.
arXiv Detail & Related papers (2024-06-12T03:40:17Z)
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS) A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z)
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis [62.07413805483241]
Steered Diffusion is a framework for zero-shot conditional image generation using a diffusion model trained for unconditional generation. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution.
arXiv Detail & Related papers (2023-09-30T02:03:22Z)
ControlCom: Controllable Image Composition using Diffusion Model [45.48263800282992]
We propose a controllable image composition method that unifies four tasks in one diffusion model. We also propose a local enhancement module to enhance the foreground details in the diffusion model. The proposed method is evaluated on both public benchmark and real-world data.
arXiv Detail & Related papers (2023-08-19T14:56:44Z)
TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition [13.087647740473205]
TF-ICON is a framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets.
arXiv Detail & Related papers (2023-07-24T02:50:44Z)
Cones 2: Customizable Image Synthesis with Multiple Subjects [50.54010141032032]
We study how to efficiently represent a particular subject as well as how to appropriately compose different subjects. By rectifying the activations in the cross-attention map, the layout appoints and separates the location of different subjects in the image.
arXiv Detail & Related papers (2023-05-30T18:00:06Z)
Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition. We propose augmenting the input image with masks that indicate the presence of target concepts. We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z)
Cross-domain Compositing with Pretrained Diffusion Models [34.98199766006208]
We employ a localized, iterative refinement scheme which infuses the injected objects with contextual information derived from the background scene. Our method produces higher quality and realistic results without requiring any annotations or training.
arXiv Detail & Related papers (2023-02-20T18:54:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.