Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion
- URL: http://arxiv.org/abs/2402.16305v1
- Date: Mon, 26 Feb 2024 05:08:40 GMT
- Title: Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion
- Authors: Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao
- Abstract summary: Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks.
We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs)
By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
- Score: 35.21106030549071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a dominant force in text-to-image generation tasks, Diffusion
Probabilistic Models (DPMs) face a critical challenge in controllability,
struggling to adhere strictly to complex, multi-faceted instructions. In this
work, we aim to address this alignment challenge for conditional generation
tasks. First, we provide an alternative view of state-of-the-art DPMs as a way
of inverting advanced Vision-Language Models (VLMs). With this formulation, we
naturally propose a training-free approach that bypasses the conventional
sampling process associated with DPMs. By directly optimizing images with the
supervision of discriminative VLMs, the proposed method can potentially achieve
a better text-image alignment. As proof of concept, we demonstrate the pipeline
with the pre-trained BLIP-2 model and identify several key designs for improved
image generation. To further enhance the image fidelity, a Score Distillation
Sampling module of Stable Diffusion is incorporated. By carefully balancing the
two components during optimization, our method can produce high-quality images
with near state-of-the-art performance on T2I-Compbench.
Related papers
- Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
For the model structure, we design a UNet architecture optimized for binarization.
We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent.
Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Controllable Image Generation With Composed Parallel Token Prediction [5.107886283951882]
compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training.
We propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space.
arXiv Detail & Related papers (2024-05-10T15:27:35Z) - Bidirectional Consistency Models [1.486435467709869]
Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector.
DMs can also invert an input image to noise by moving backward along the probability flow ordinary differential equation (PF ODE)
arXiv Detail & Related papers (2024-03-26T18:40:36Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
Our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior.
Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs)
We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z) - Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment
for Markup-to-Image Generation [15.411325887412413]
This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM)
FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation.
Experiments are conducted on four benchmark datasets from different domains.
arXiv Detail & Related papers (2023-08-02T13:43:03Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model.
Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z) - CDPMSR: Conditional Diffusion Probabilistic Models for Single Image
Super-Resolution [91.56337748920662]
Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation.
We propose a simple but non-trivial DPM-based super-resolution post-process framework,i.e., cDPMSR.
Our method surpasses prior attempts on both qualitative and quantitative results.
arXiv Detail & Related papers (2023-02-14T15:13:33Z) - BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models [50.39417112077254]
A novel image-to-image translation method based on the Brownian Bridge Diffusion Model (BBDM) is proposed.
To the best of our knowledge, it is the first work that proposes Brownian Bridge diffusion process for image-to-image translation.
arXiv Detail & Related papers (2022-05-16T13:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.