Related papers: Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

URL: http://arxiv.org/abs/2402.16305v1
Date: Mon, 26 Feb 2024 05:08:40 GMT
Title: Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
Authors: Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao
Abstract summary: Diffusion Probabilistic Models (DPMs) are dominant force in text-to-image generation tasks. We propose an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs) By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment.
Score: 35.21106030549071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench.

Related papers

D2C: Unlocking the Potential of Continuous Autoregressive Image Generation with Discrete Tokens [80.75893450536577]
We propose D2C, a novel two-stage method to enhance model generation capacity. In the first stage, the discrete-valued tokens representing coarse-grained image features are sampled by employing a small discrete-valued generator. In the second stage, the continuous-valued tokens representing fine-grained image features are learned conditioned on the discrete token sequence.
arXiv Detail & Related papers (2025-03-21T13:58:49Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
OminiControl: Minimal and Universal Control for Diffusion Transformer [68.3243031301164]
OminiControl is a framework that integrates image conditions into pre-trained Diffusion Transformer (DiT) models. At its core, OminiControl leverages a parameter reuse mechanism, enabling the DiT to encode image conditions using itself as a powerful backbone. OminiControl addresses a wide range of image conditioning tasks in a unified manner, including subject-driven generation and spatially-aligned conditions.
arXiv Detail & Related papers (2024-11-22T17:55:15Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
Controllable Image Generation With Composed Parallel Token Prediction [5.107886283951882]
compositional image generation requires models to generalise well in situations where two or more input concepts do not necessarily appear together in training. We propose a formulation for controllable conditional generation of images via composing the log-probability outputs of discrete generative models of the latent space.
arXiv Detail & Related papers (2024-05-10T15:27:35Z)
DivCon: Divide and Conquer for Progressive Text-to-Image Generation [0.0]
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. layout is employed as an intermedium to bridge large language models and layout-based diffusion models. We introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks.
arXiv Detail & Related papers (2024-03-11T03:24:44Z)
Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive [21.49096276631859]
Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. We propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM) Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout.
arXiv Detail & Related papers (2024-01-16T20:31:46Z)
Image Inpainting via Tractable Steering of Diffusion Models [54.13818673257381]
This paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs) We show that our approach can consistently improve the overall quality and semantic coherence of inpainted images with only 10% additional computational overhead.
arXiv Detail & Related papers (2023-11-28T21:14:02Z)
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation [15.411325887412413]
This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM) FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Experiments are conducted on four benchmark datasets from different domains.
arXiv Detail & Related papers (2023-08-02T13:43:03Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
StraIT: Non-autoregressive Generation with Stratified Image Transformer [63.158996766036736]
Stratified Image Transformer(StraIT) is a pure non-autoregressive(NAR) generative model. Our experiments demonstrate that StraIT significantly improves NAR generation and out-performs existing DMs and AR methods.
arXiv Detail & Related papers (2023-03-01T18:59:33Z)
CDPMSR: Conditional Diffusion Probabilistic Models for Single Image Super-Resolution [91.56337748920662]
Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation. We propose a simple but non-trivial DPM-based super-resolution post-process framework,i.e., cDPMSR. Our method surpasses prior attempts on both qualitative and quantitative results.
arXiv Detail & Related papers (2023-02-14T15:13:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.