Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion
- URL: http://arxiv.org/abs/2402.16305v1
- Date: Mon, 26 Feb 2024 05:08:40 GMT
- Title: Referee Can Play: An Alternative Approach to Conditional Generation via
Model Inversion
- Authors: Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao
- Abstract summary: Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks.
We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs)
By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
- Score: 35.21106030549071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a dominant force in text-to-image generation tasks, Diffusion
Probabilistic Models (DPMs) face a critical challenge in controllability,
struggling to adhere strictly to complex, multi-faceted instructions. In this
work, we aim to address this alignment challenge for conditional generation
tasks. First, we provide an alternative view of state-of-the-art DPMs as a way
of inverting advanced Vision-Language Models (VLMs). With this formulation, we
naturally propose a training-free approach that bypasses the conventional
sampling process associated with DPMs. By directly optimizing images with the
supervision of discriminative VLMs, the proposed method can potentially achieve
a better text-image alignment. As proof of concept, we demonstrate the pipeline
with the pre-trained BLIP-2 model and identify several key designs for improved
image generation. To further enhance the image fidelity, a Score Distillation
Sampling module of Stable Diffusion is incorporated. By carefully balancing the
two components during optimization, our method can produce high-quality images
with near state-of-the-art performance on T2I-Compbench.
Related papers
- OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models.
At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone.
OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs)
Existing binarization methods result in significant performance degradation.
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Controllable Image Generation With Composed Parallel Token Prediction [5.107886283951882]
compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training.
We propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space.
arXiv Detail & Related papers (2024-05-10T15:27:35Z) - DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements.
layout is employed as an intermedium to bridge large language models and layout-based diffusion models.
We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z) - Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive [21.49096276631859]
Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout.
We propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM)
Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout.
arXiv Detail & Related papers (2024-01-16T20:31:46Z) - Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior.
Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs)
We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z) - Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment
for Markup-to-Image Generation [15.411325887412413]
This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM)
FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation.
Experiments are conducted on four benchmark datasets from different domains.
arXiv Detail & Related papers (2023-08-02T13:43:03Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model.
Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z) - CDPMSR: Conditional Diffusion Probabilistic Models for Single Image
Super-Resolution [91.56337748920662]
Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation.
We propose a simple but non-trivial DPM-based super-resolution post-process framework,i.e., cDPMSR.
Our method surpasses prior attempts on both qualitative and quantitative results.
arXiv Detail & Related papers (2023-02-14T15:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.