Are Diffusion Models Vision-And-Language Reasoners?
- URL: http://arxiv.org/abs/2305.16397v3
- Date: Thu, 2 Nov 2023 18:36:09 GMT
- Title: Are Diffusion Models Vision-And-Language Reasoners?
- Authors: Benno Krojer, Elinor Poole-Dayan, Vikram Voleti, Christopher Pal, Siva
Reddy
- Abstract summary: We transform diffusion-based models for any image-text matching (ITM) task using a novel method called DiffusionITM.
We introduce the Generative-Discriminative Evaluation Benchmark (GDBench) benchmark with 7 complex vision-and-language tasks, bias evaluation and detailed analysis.
We find that Stable Diffusion + DiffusionITM is competitive on many tasks and outperforms CLIP on compositional tasks like CLEVR and Winoground.
- Score: 30.579483430697803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-conditioned image generation models have recently shown immense
qualitative success using denoising diffusion processes. However, unlike
discriminative vision-and-language models, it is a non-trivial task to subject
these diffusion-based generative models to automatic fine-grained quantitative
evaluation of high-level phenomena such as compositionality. Towards this goal,
we perform two innovations. First, we transform diffusion-based models (in our
case, Stable Diffusion) for any image-text matching (ITM) task using a novel
method called DiffusionITM. Second, we introduce the Generative-Discriminative
Evaluation Benchmark (GDBench) benchmark with 7 complex vision-and-language
tasks, bias evaluation and detailed analysis. We find that Stable Diffusion +
DiffusionITM is competitive on many tasks and outperforms CLIP on compositional
tasks like like CLEVR and Winoground. We further boost its compositional
performance with a transfer setup by fine-tuning on MS-COCO while retaining
generative capabilities. We also measure the stereotypical bias in diffusion
models, and find that Stable Diffusion 2.1 is, for the most part, less biased
than Stable Diffusion 1.5. Overall, our results point in an exciting direction
bringing discriminative and generative model evaluation closer. We will release
code and benchmark setup soon.
Related papers
- Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization [68.69203905664524]
We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively.
We have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low interpretability.
Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0.
arXiv Detail & Related papers (2024-06-10T15:42:03Z) - Aligning Diffusion Models by Optimizing Human Utility [1.6166249658374658]
Diffusion-KTO is a novel approach for aligning text-to-image diffusion models with human preferences.
Our objective requires simple per-image binary feedback signals, e.g. likes or dislikes, which are abundantly available.
arXiv Detail & Related papers (2024-04-06T01:23:23Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - Your Diffusion Model is Secretly a Zero-Shot Classifier [90.40799216880342]
We show that density estimates from large-scale text-to-image diffusion models can be leveraged to perform zero-shot classification.
Our generative approach to classification attains strong results on a variety of benchmarks.
Our results are a step toward using generative over discriminative models for downstream tasks.
arXiv Detail & Related papers (2023-03-28T17:59:56Z) - DiffusionBERT: Improving Generative Masked Language Models with
Diffusion Models [81.84866217721361]
DiffusionBERT is a new generative masked language model based on discrete diffusion models.
We propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step.
Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improvement over existing diffusion models for text.
arXiv Detail & Related papers (2022-11-28T03:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.