Zero-Shot Image Harmonization with Generative Model Prior
- URL: http://arxiv.org/abs/2307.08182v2
- Date: Mon, 11 Mar 2024 14:08:04 GMT
- Title: Zero-Shot Image Harmonization with Generative Model Prior
- Authors: Jianqi Chen, Yilan Zhang, Zhengxia Zou, Keyan Chen, Zhenwei Shi
- Abstract summary: We propose a zero-shot approach to image harmonization, aiming to overcome the reliance on large amounts of synthetic composite images.
We introduce a fully modularized framework inspired by human behavior.
We present compelling visual results across diverse scenes and objects, along with a user study validating our approach.
- Score: 22.984119094424056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a zero-shot approach to image harmonization, aiming to overcome
the reliance on large amounts of synthetic composite images in existing
methods. These methods, while showing promising results, involve significant
training expenses and often struggle with generalization to unseen images. To
this end, we introduce a fully modularized framework inspired by human
behavior. Leveraging the reasoning capabilities of recent foundation models in
language and vision, our approach comprises three main stages. Initially, we
employ a pretrained vision-language model (VLM) to generate descriptions for
the composite image. Subsequently, these descriptions guide the foreground
harmonization direction of a text-to-image generative model (T2I). We refine
text embeddings for enhanced representation of imaging conditions and employ
self-attention and edge maps for structure preservation. Following each
harmonization iteration, an evaluator determines whether to conclude or modify
the harmonization direction. The resulting framework, mirroring human behavior,
achieves harmonious results without the need for extensive training. We present
compelling visual results across diverse scenes and objects, along with a user
study validating the effectiveness of our approach.
Related papers
- Causal Image Modeling for Efficient Visual Understanding [41.87857129429512]
We introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations.
This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length.
In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework.
arXiv Detail & Related papers (2024-10-10T04:14:52Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects.
We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z) - DiffHarmony: Latent Diffusion Model Meets Image Harmonization [11.500358677234939]
Diffusion models have promoted the rapid development of image-to-image translation tasks.
Fine-tuning pre-trained latent diffusion models from scratch is computationally intensive.
In this paper, we adapt a pre-trained latent diffusion model to the image harmonization task to generate harmonious but potentially blurry initial images.
arXiv Detail & Related papers (2024-04-09T09:05:23Z) - Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs [57.492124844326206]
This work delves into the task of pose-free novel view synthesis from stereo pairs, a challenging and pioneering task in 3D vision.
Our innovative framework, unlike any before, seamlessly integrates 2D correspondence matching, camera pose estimation, and NeRF rendering, fostering a synergistic enhancement of these tasks.
arXiv Detail & Related papers (2023-12-12T13:22:44Z) - Integrating View Conditions for Image Synthesis [14.738884513493227]
This paper introduces a pioneering framework that integrates viewpoint information to enhance the control of image editing tasks.
We distill three essential criteria -- consistency, controllability, and harmony -- that should be met for an image editing method.
arXiv Detail & Related papers (2023-10-24T16:55:07Z) - Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models [13.019535928387702]
This paper presents Progressive Conditional Diffusion Models (PCDMs) that incrementally bridge the gap between person images under the target and source poses through three stages.
Both qualitative and quantitative results demonstrate the consistency and photorealism of our proposed PCDMs under challenging scenarios.
arXiv Detail & Related papers (2023-10-10T05:13:17Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - SSH: A Self-Supervised Framework for Image Harmonization [97.16345684998788]
We propose a novel Self-Supervised Harmonization framework (SSH) that can be trained using just "free" natural images without being edited.
Our results show that the proposedSSH outperforms previous state-of-the-art methods in terms of reference metrics, visual quality, and subject user study.
arXiv Detail & Related papers (2021-08-15T19:51:33Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.